SlideShare a Scribd company logo
1 of 28
© 2014 MapR Technologies 1© 2014 MapR Technologies
An Overview of Apache Spark
© 2014 MapR Technologies 2
Agenda
• Before Spark
• What is Spark?
• Why is it cool?
• Spark and Storm
• What are the Issues?
© 2014 MapR Technologies 3
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Or use a higher level language or DSL that does this for you
© 2014 MapR Technologies 4
Enter Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
© 2014 MapR Technologies 5
The Spark Community
© 2014 MapR Technologies 6
Unified Platform
Spark SQL
(SQL Query)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Amazon S3
YARN / Mesos / Standalone (resource management)
Languages
Data Sources
© 2014 MapR Technologies 7
The Execution Engine
Spark SQL
(SQL Query)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
YARN / Mesos / Standalone (resource management)
• Mapreduce heir apparent
• ‘nuff said
© 2014 MapR Technologies 8
Spark SQL
Spark SQL
(SQL Query)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
YARN / Mesos / Standalone (resource management)
• SQL query over your data
• Automatic schema discovery for JSON
• RDDs from SQL queries
• If you have Hive, you can try out Spark SQL very
easily
© 2014 MapR Technologies 9
Spark Streaming
Spark SQL
(SQL Query)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
YARN / Mesos / Standalone (resource management)
• Micro-batching of a potentially unbounded stream
• Batch operations applied to a stream of data
• Streaming makes available additional APIs, but APIs
remain very consistent
– This promotes reusability of code
© 2014 MapR Technologies 10
MLLib - Machine Learning
Spark SQL
(SQL Query)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
YARN / Mesos / Standalone (resource management)
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regularized Logistic Regression
• Alternating Least Squares
• Naive Bayes
• Stochastic Gradient Descent
© 2014 MapR Technologies 11
GraphX - Graph Computation on Spark
Spark SQL
(SQL Query)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
YARN / Mesos / Standalone (resource management)
• Solve graph problems on Spark
– Page Rank
– Social
© 2014 MapR Technologies 12
So What?
• Spark is a complete data analysis stack.
• People who are starting now can build on Spark and not miss out*
• It’s easy to get started, start small and get value quickly.
* Hadoop’s ecosystem and M/R are arguably more stable and proven, but this is
changing fast.
© 2014 MapR Technologies 13© 2014 MapR Technologies
Why is Spark Cool?
© 2014 MapR Technologies 14
Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
© 2014 MapR Technologies 15
Easy to Develop - The REPL
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
– Compile / save your code for scheduled jobs later
• spark-shell
• pyspark
$ bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
scala>
© 2014 MapR Technologies 16
Easy to Develop - A small API
• The canonical example - wordcount
– Short! Even in Java!
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" ")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
© 2014 MapR Technologies 17
Fast to Run - RDDs
• Spark’s fundamental abstraction is the RDD
• “Fault-tolerant collection of elements that can be operated on in
parallel”
– Parallelized Collection: Scala collection, parallelized
– Data pulled from some parallelization-aware data source, such as HDFS
• Transformations
– Creation of a new dataset from an existing
• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions
– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
© 2014 MapR Technologies 18
Fast to Run - RDD Persistence, Caching
• Variety of storage levels
– memory_only (default), memory_and_disk, etc…
• API Calls
– persist(StorageLevel)
– cache() – shorthand for persist(StorageLevel.MEMORY_ONLY)
• Considerations
– Read from disk vs. recompute (memory_and_disk)
– Total memory storage size (memory_only_ser)
– Replicate to second node for faster fault recovery (memory_only_2)
• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
© 2014 MapR Technologies 19
RDD Fault Recovery
RDDs track lineage information that can be used to efficiently
recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
© 2014 MapR Technologies 20
Spark Streaming - DStreams
• An infinite stream of data, broken up into finite RDDs
• Because Streaming uses the same abstraction (RDDs), it is
possible to re-use code. This is pretty cool.
• Not ideal for the lowest latency applications. You need to be ok
with >1s.
• It is ideal for ML, and problems that benefit from iteration.
© 2014 MapR Technologies 22
Comparing Spark Streaming and Storm
• Spark Streaming
• Less mature
• DStreams of RDDs
• >1s latency
• Micro-batched
• Stateful, Exactly-once
– Spark handles failure
automatically
• Scala, Java (no python)
• Storm
• More mature
• Tuples
• Lowest latency
• Event-based (tuple at a time)
• Stateless (w/o Trident)
– W/o trident, Storm does not
handle failure automatically
• Java, Clojure, Python, Ruby
Spark Streaming and Storm Trident are more directly comparable
© 2014 MapR Technologies 23
How to Choose one over the other?
• Spark Streaming
• Buzzword compliant!
• High-level, clean APIs
• Code reuse across batch
and streaming use cases
• Storm/Trident
• Production stable
• More language options
• Supports at least once,
exactly one and at most
once reliability
Production in 3 weeks? Storm. Production in 3-6 months? Spark?
© 2014 MapR Technologies 24
Why Are People Loving on Spark?
• The REPL makes for really easy data exploration.
• Batch and streaming, using the same API.
• Spark APIs are not difficult to pick up
• It’s possible to get a lot done in a lot less code than Mapreduce
• Spark leverages existing technologies for data ingest and resource
management, making it easy to adopt alongside Hadoop.
© 2014 MapR Technologies 25
What are the Issues?
• Scaling. Anecdotally, small clusters with larger machines do
better than large clusters with smaller machines. (though Yahoo
has 1200 nodes)
• Failure modes exist where data can be lost. Use a highly reliable
storage backend so you can replay things. Kafka is a good idea.
This is documented.
• Lots of this comes down to Spark’s relative immaturity, and that
will change.
© 2014 MapR Technologies 26© 2014 MapR Technologies
Some Examples
© 2014 MapR Technologies 27
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" ")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)
• Java Spark (~ 7 lines of code)
• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()
• Java8 (4 lines of code)
© 2014 MapR Technologies 28
Spark Streaming - Network Word Count
// Create the context with a 1 second batch size
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
© 2014 MapR Technologies 29
Thanks!
@mapr maprtech
vgonzalez@mapr.com
MapR
maprtech
mapr-technologies

More Related Content

What's hot

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

What's hot (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

Viewers also liked

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
Pancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanPancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanElla Feby
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 

Viewers also liked (7)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Pancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanPancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraan
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 

Similar to Cleveland Hadoop Users Group - Spark

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 

Similar to Cleveland Hadoop Users Group - Spark (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Spark 101
Spark 101Spark 101
Spark 101
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Module01
 Module01 Module01
Module01
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 

Recently uploaded

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Recently uploaded (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Cleveland Hadoop Users Group - Spark

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies An Overview of Apache Spark
  • 2. © 2014 MapR Technologies 2 Agenda • Before Spark • What is Spark? • Why is it cool? • Spark and Storm • What are the Issues?
  • 3. © 2014 MapR Technologies 3 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Or use a higher level language or DSL that does this for you
  • 4. © 2014 MapR Technologies 4 Enter Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
  • 5. © 2014 MapR Technologies 5 The Spark Community
  • 6. © 2014 MapR Technologies 6 Unified Platform Spark SQL (SQL Query) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Amazon S3 YARN / Mesos / Standalone (resource management) Languages Data Sources
  • 7. © 2014 MapR Technologies 7 The Execution Engine Spark SQL (SQL Query) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) YARN / Mesos / Standalone (resource management) • Mapreduce heir apparent • ‘nuff said
  • 8. © 2014 MapR Technologies 8 Spark SQL Spark SQL (SQL Query) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) YARN / Mesos / Standalone (resource management) • SQL query over your data • Automatic schema discovery for JSON • RDDs from SQL queries • If you have Hive, you can try out Spark SQL very easily
  • 9. © 2014 MapR Technologies 9 Spark Streaming Spark SQL (SQL Query) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) YARN / Mesos / Standalone (resource management) • Micro-batching of a potentially unbounded stream • Batch operations applied to a stream of data • Streaming makes available additional APIs, but APIs remain very consistent – This promotes reusability of code
  • 10. © 2014 MapR Technologies 10 MLLib - Machine Learning Spark SQL (SQL Query) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) YARN / Mesos / Standalone (resource management) • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent
  • 11. © 2014 MapR Technologies 11 GraphX - Graph Computation on Spark Spark SQL (SQL Query) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) YARN / Mesos / Standalone (resource management) • Solve graph problems on Spark – Page Rank – Social
  • 12. © 2014 MapR Technologies 12 So What? • Spark is a complete data analysis stack. • People who are starting now can build on Spark and not miss out* • It’s easy to get started, start small and get value quickly. * Hadoop’s ecosystem and M/R are arguably more stable and proven, but this is changing fast.
  • 13. © 2014 MapR Technologies 13© 2014 MapR Technologies Why is Spark Cool?
  • 14. © 2014 MapR Technologies 14 Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 15. © 2014 MapR Technologies 15 Easy to Develop - The REPL • Iterative Development – Cache those RDDs – Open the shell and ask questions – Compile / save your code for scheduled jobs later • spark-shell • pyspark $ bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. scala>
  • 16. © 2014 MapR Technologies 16 Easy to Develop - A small API • The canonical example - wordcount – Short! Even in Java! SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://...");
  • 17. © 2014 MapR Technologies 17 Fast to Run - RDDs • Spark’s fundamental abstraction is the RDD • “Fault-tolerant collection of elements that can be operated on in parallel” – Parallelized Collection: Scala collection, parallelized – Data pulled from some parallelization-aware data source, such as HDFS • Transformations – Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions – Return a value after running a computation • collect, count, first, takeSample, foreach, etc…
  • 18. © 2014 MapR Technologies 18 Fast to Run - RDD Persistence, Caching • Variety of storage levels – memory_only (default), memory_and_disk, etc… • API Calls – persist(StorageLevel) – cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations – Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
  • 19. © 2014 MapR Technologies 19 RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 20. © 2014 MapR Technologies 20 Spark Streaming - DStreams • An infinite stream of data, broken up into finite RDDs • Because Streaming uses the same abstraction (RDDs), it is possible to re-use code. This is pretty cool. • Not ideal for the lowest latency applications. You need to be ok with >1s. • It is ideal for ML, and problems that benefit from iteration.
  • 21. © 2014 MapR Technologies 22 Comparing Spark Streaming and Storm • Spark Streaming • Less mature • DStreams of RDDs • >1s latency • Micro-batched • Stateful, Exactly-once – Spark handles failure automatically • Scala, Java (no python) • Storm • More mature • Tuples • Lowest latency • Event-based (tuple at a time) • Stateless (w/o Trident) – W/o trident, Storm does not handle failure automatically • Java, Clojure, Python, Ruby Spark Streaming and Storm Trident are more directly comparable
  • 22. © 2014 MapR Technologies 23 How to Choose one over the other? • Spark Streaming • Buzzword compliant! • High-level, clean APIs • Code reuse across batch and streaming use cases • Storm/Trident • Production stable • More language options • Supports at least once, exactly one and at most once reliability Production in 3 weeks? Storm. Production in 3-6 months? Spark?
  • 23. © 2014 MapR Technologies 24 Why Are People Loving on Spark? • The REPL makes for really easy data exploration. • Batch and streaming, using the same API. • Spark APIs are not difficult to pick up • It’s possible to get a lot done in a lot less code than Mapreduce • Spark leverages existing technologies for data ingest and resource management, making it easy to adopt alongside Hadoop.
  • 24. © 2014 MapR Technologies 25 What are the Issues? • Scaling. Anecdotally, small clusters with larger machines do better than large clusters with smaller machines. (though Yahoo has 1200 nodes) • Failure modes exist where data can be lost. Use a highly reliable storage backend so you can replay things. Kafka is a good idea. This is documented. • Lots of this comes down to Spark’s relative immaturity, and that will change.
  • 25. © 2014 MapR Technologies 26© 2014 MapR Technologies Some Examples
  • 26. © 2014 MapR Technologies 27 SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://..."); val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) – interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
  • 27. © 2014 MapR Technologies 28 Spark Streaming - Network Word Count // Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()
  • 28. © 2014 MapR Technologies 29 Thanks! @mapr maprtech vgonzalez@mapr.com MapR maprtech mapr-technologies

Editor's Notes

  1. Who has Spark in production? Who has a POC? Who has tried using the REPL? Who has never heard of Spark before today?
  2. I’d like to tell you a little about why I think Spark is cool, and why you should consider applying it to your data problems. I’ll talk about the world of Mapreduce, AKA, before spark. I’ll talk about what Spark is and why it’s cool, and we’ll talk a little about the comparisons between spark and storm. Finally, we’ll talk about what the issues people report are.
  3. Mapreduce is conceptually simple, but is difficult to program, and is really only suitable for batch-style work. Today, we want a more interactive, iterative experience for processing big data. Mapreduce involves mappers, an automatic shuffle phase, and a reduce phase. As we move through these stages, there’s intermediate data produced, and it needs to be stored somewhere for subsequent phases and tasks. This means that you’ll often read and write the same data multiple times in mapreduce. Later, DSLs like hive and pig came along to make processing easier, compiling a simpler language down to mapreduce. So with the DSLs, we gain efficiency in development of the processing pipelines, but mapreduce remains. DSLs eliminate boilerplate and can do some optimization, but do not get around the multiple stages of mapreduce, which necessarily involve extra I/O.
  4. Spark was developed at UC Berkeley in recent years, and went open source in 2010. To me, Spark represents a real refinement over Hadoop, that brings a long list of benefits you’ll want as a part of your life.
  5. In about 4 years Spark has seen tremendous adoption. 330 contributors, 50 companies, 10 platforms distributing Spark (as of 10/31). About a year ago, Databricks was formed to support Spark commercially. Yahoo, Adobe, Ooyala, Sharethrough are in production with Spark, and all the logos you see here (and probably more) have devoted some effort to supporting and advancing the Spark community. http://spark-summit.org/wp-content/uploads/2014/07/Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
  6. Spark aims to be a unified platform for data processing, handling both batch and streaming use cases. The general purpose execution engine plugs into Mesos and YARN for resource management, and can also use its own standalone clustering capability. We can write programs for Spark in Scala, python and Java, and there are APIs included that allow us to: (Spark SQL) write SQL queries in our code, and get back RDDs (Streaming) process a never-ending stream of data at low latency (MLLib) use a machine learning library for batch and iterative use cases (GraphX) use a graph computation library to solve graph problems Data sources include HDFS, Amazon S3, Kafka and of course your local file system - chances are if you’ve got a bunch of data, it’s going to be pretty easy for Spark to get at it. All of this is supported by a really user-friendly API and very good and constantly improving documentation. Let’s talk about some of the components that make up the Spark platform… === Resources: Intro to Scala: http://www.artima.com/scalazine/articles/steps.html
  7. Aims to generalize mapreduce and provides a relatively small API. It aims to use memory and pipelining to achieve dramatically better than mapreduce performance. Spark appears to be poised to overtake Mapreduce as the dominant big data execution engine.
  8. SQL Queries over your data, plus a mode of operation that acts as a “faster Hive”. Spark SQL can use your hive metastore, and query your hive tables using the Spark engine, which should be faster than Hive in many (but not all) cases. JDBC available today.
  9. Spark Streaming - operate on a stream of data. Data can come from a distributed file system such as HDFS, a messaging system such as Kafka, or even a network socket. The framework turns the stream into a series of RDDs, or micro-batches based on a window of time, which can then be operated on using the same API as is used for batch.
  10. MLLib is a machine learning library that aims to be fast, and more correct through iteration. My data scientist friends tell me that one way to get better results is to use a simple algorthm over more data. Spark and MLLib are terrific at iteration, so it’s a natural fit for the application of algorithms that benefit from iteration. MLLib isn’t just for Ph.D’s though. It’s also got summary statistics that those of us with a high-school level understanding of statistics can use on our data.
  11. Graph Basics: http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=graphsDataStrucs1 Advanced Training on GraphX - https://www.youtube.com/watch?v=Rb2oBG039dM Taking Wikipedia as an example, you could: Compute page rank using the link graph - the connections between Wikipedia pages - to generate a top-N pages report Analyze the connections between terms and the documents they’re used in Perform analysis of Wikipedia editors, to gain some understanding of the Wikipedia editor community
  12. These are the headlines - Spark is easy to develop in, and it’s FAST. Let’s unroll some of this.
  13. The REPL is a shell for Spark. You can write scala or python code, and get immediate feedback. You can explore your data, test out ideas extremely quickly.
  14. Even in Java (8), you can express programs with minimal cruft.
  15. RDDs are resilient distributed data sets, and they are a fundamental component of Spark. You can create RDDs in two ways - by parallelizing an existing data set (for example, generating an array then telling Spark to distribute it to workers) or by obtaining data from some external storage system. Once created, you can transform the data. However, when you make transformations, you are not actually modifying the data in place, because RDDs are immutable - transformations always create new RDDs. Transformations are your instructions on how to modify an RDD. Spark takes your transformations, and creates a graph of operations to carry out against the data. Nothing actually happens with your data until you perform an action, which forces Spark to evaluate and execute the graph in order to present you some result. Like in Mapreduce DSLs, this allows for a “compile” stage of execution that enables optimization - pipelining, for example. Thus, even though your Spark program may spawn many stages, Spark can avoid the intermediate I/O that is one of the reasons mapreduce is considered “slow”. Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  16. By default, Spark will recompute your RDDs each time you call an action. You can however have Spark persist your RDDs according to your preference. Among your options is to cache the RDDs, using memory. Caching an RDD in memory means you get to avoid recomputing the data from the transformation graph. This can obviously have a dramatic effect on performance. You can also persist RDDs to disk, or use a combination of the two. Spark will automatically spill to disk if you request caching, but lack the RAM to hold the working set.
  17. If you’re storing stuff in memory, at some point you’ll become concerned about how Spark will react to the loss of a node. The R in RDD stands for resilient, and the way Spark achieves resilience is by tracking data lineage. What this means is that Spark will maintain a graph of transformations from which it can recompute lost data. In the example here, we’ve got a simple scenario in which we’ve loaded a partition of data from HDFS, filtered it, yielding a filtered RDD, then we’ve applied a function to each of the elements of the RDDs, yielding a mapped RDD. If you cache the RDD “msgs”, and the node fails, Spark will recompute the RDD using the lineage information.
  18. If you’re storing stuff in memory, at some point you’ll become concerned about how Spark will react to the loss of a node. The R in RDD stands for resilient, and the way Spark achieves resilience is by tracking data lineage. What this means is that Spark will maintain a graph of transformations from which it can recompute lost data. In the example here, we’ve got a simple scenario in which we’ve loaded a partition of data from HDFS, filtered it, yielding a filtered RDD, then we’ve applied a function to each of the elements of the RDDs, yielding a mapped RDD. If you cache the RDD “msgs”, and the node fails, Spark will recompute the RDD using the lineage information.
  19. The lineage information is stored in an directed, acyclic graph, or a DAG.
  20. http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming http://yahoohadoop.tumblr.com/post/98213421641/storm-and-spark-at-yahoo-why-chose-one-over-the-other http://realtime-cachedmind.tumblr.com/post/89974796387/real-time-processing-storm-trident-vs-spark-streaming
  21. https://www.youtube.com/watch?v=FjhRkfAuU7I
  22. http://spark-summit.org/wp-content/uploads/2014/07/Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
  23. http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/