SlideShare a Scribd company logo
1 of 85
Apache Spark, an Introduction
Jonathan Lacefield – Solution Architect
DataStax
Disclaimer
The contents of this presentation represent my
personal views and do not reflect or represent
any views of my employer.
This is my take on Spark.
This is not DataStax’s take on Spark.
Notes
• Meetup Sponsor:
– Data Exchange Platform
– Core Software Engineering – Equifax
• Announcement:
– Data Exchange Platform is currently hiring to build the
next generation data platform. We are looking for
people with experience in one or more of the
following skills: Spark, Storm, Kafka, samza, Hadoop,
Cassandra
– How to apply?
– Email aravind.yarram@equifax.com
Introduction
• Jonathan Lacefield
– Solutions Architect, DataStax
– Former Dev, DBA, Architect, reformed PM
– Email: jlacefie@gmail.com
– Twitter: @jlacefie
– LinkedIn: www.linkedin.com/in/jlacefield
This deck represents my own views and not the
views of my employer
DataStax Introduction
DataStax delivers Apache Cassandra in a database platform
purpose built for the performance and availability demands of
IOT, web, and mobile applications, giving enterprises a secure
always-on database that remains operationally simple when
scaled in a single datacenter or across multiple datacenters
and clouds.
Includes
1. Apache Cassandra
2. Apache Spark
3. Apache SOLR
4. Apache Hadoop
5. Graph Coming Soon
DataStax, What we Do (Use Cases)
• Fraud Detection
• Personalization
• Internet of Things
• Messaging
• Lists of Things (Products, Playlists, etc)
• Smaller set of other things too!
We are all about working with temporal data sets at
large volumes with high transaction counts
(velocity).
Agenda
• Set Baseline (Pre Distributed Days and
Hadoop)
• Spark Conceptual Introduction
• Spark Key Concepts (Core)
• Spark Look at Each Module
– Spark SQL
– MLIB
– Spark Streaming
– GraphX
In the Beginning….
OLTP
Web Application Tier
OLAP
Statistical/Analytical Applications
ETL
Data Requirements
Broke
the Architecture
Along Came Hadoop with ….
Map Reduce
Lifecycle of a MapReduce Job
But….
• Started in 2009 in Berkley’s AMP Lab
• Open Sources in 2010
• Commercial Provider is Databricks – http://databricks.com
• Solve 2 Big Hadoop Pain Points
Speed - In Memory and Fault Tolerant
Ease of Use – API of operations and datasets
Use Cases for Apache Spark
• Data ETL
• Interactive dashboard creation for customers
• Streaming (e.g., fraud detection, real-time
video optimization)
• “Complex analytics” (e.g., anomaly detection,
trend analysis)
Key Concepts - Core
• Resilient Distributed Datasets (RDDs) – Spark’s datasets
• Spark Context – Provides information on the Spark environment
and the application
• Transformations - Transforms data
• Actions - Triggers actual processing
• Directed Acyclic Graph (DAG) – Spark’s execution algorithm
• Broadcast Variables – Read only variables on Workers
• Accumulators – Variables that can be added to with an
associated function on Workers
• Driver - “Main” application container for Spark Execution
• Executors – Execute tasks on data
• Resource Manager – Manages task assignment and status
• Worker – Execute and Cache
Resilient Distributed Datasets (RDDs)
• Fault tolerant collection of elements that enable
parallel processing
• Spark’s Main Abstraction
• Transformation and Actions are executed against
RDDs
• Can persist in Memory, on Disk, or both
• Can be partitioned to control parallel processing
• Can be reused
– HUGE Efficiencies with processing
RDDs - Resilient
Source – databricks.com
HDFS File Filtered RDD Mapped RDD
filter
(func = someFilter(…))
map
(func = someAction(...))
RDDs track lineage information that can be used to
efficiently recompute lost data
RDDs - Distributed
Image Source - http://1.bp.blogspot.com/-jjuVIANEf9Y/Ur3vtjcIdgI/AAAAAAAABC0/-Ou9nANPeTs/s1600/p1.pn
RDDs – From the API
val someRdd = sc.textFile(someURL)
• Create an RDD from a text file
val lines = sc.parallelize(List("pandas", "i like pandas"))
• Create an RDD from a list of elements
• Can create RDDs from many different sources
• RDDs can, and should, be persisted in most cases
– lines.persist() or lines.cache()
• See here for more info
– http://spark.apache.org/docs/1.2.0/programming-guide.html
Transformations
• Create one RDD and transform the contents into another RDD
• Examples
– Map
– Filter
– Union
– Distinct
– Join
• Complete list -
http://spark.apache.org/docs/1.2.0/programming-guide.html
• Lazy execution
– Transformations aren’t applied to an RDD until an Action is executed
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
Actions
• Cause data to be returned to driver or saved to output
• Cause data retrieval and execution of all
Transformations on RDDs
• Common Actions
– Reduce
– Collect
– Take
– SaveAs….
• Complete list - http://spark.apache.org/docs/1.2.0/programming-
guide.html
• errorsRDD.take(1)
Example App
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”,
sys.argv[0], None)
lines = sc.textFile(sys.argv[1])
counts = lines.flatMap(lambda s: s.split(“ ”)) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda x, y: x + y)
counts.saveAsTextFile(sys.argv[2])
Based on source from – databricks.com
1
2
3
Conceptual Representation
RDD
RDD
RDD
RDD
Transformations
Action Value
counts = lines.flatMap(lambda s: s.split(“ ”)) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda x, y: x + y)
counts.saveAsTextFile(sys.argv[2])
lines = sc.textFile(sys.argv[1])
Based on source from – databricks.com
1
2
3
Spark Execution
Image Source – Learning Spark http://shop.oreilly.com/product/0636920028512.do
Demo
Via the REPL
Spark SQL
Abstraction of Spark API to support SQL like interaction
Parse
Analyze
LogicalPlan
Optimize
Spark SQL
HiveQL
PhysicalPlan
Execute
Catalyst SQL Core
• Programming Guide - https://spark.apache.org/docs/1.2.0/sql-programming-guide.html
• Used for code source in examples
• Catalyst - http://spark-summit.org/talk/armbrust-catalyst-a-query-optimization-framework-for-spark-and-shark/
SQLContext and SchemaRDD
val sc: SparkContext // An existing SparkContext. val sqlContext = new
org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
SchemaRDD can be created
1) Using reflection to infer schema Structure from an existing RDD
2) Programmable interface to create Schema and apply to an RDD
SchemaRDD Creation - Reflection
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p =>
Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
SchemaRDD Creation - Explicit
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t => "Name: " + t(0)).collect().foreach(println)
Data Frames
• Data Frames will replace SchemaRDD
• https://databricks.com/blog/2015/02/17/intr
oducing-dataframes-in-spark-for-large-scale-
data-science.html
Demo
• SparkSQL via the REPL
Once Schema Exists on and RDD
It’s either Spark SQL or HiveQL
Can use Thrift ODBC/JDBC for Remote Execution
MLib
• Scalable, distributed, Machine Learning library
• Base Statistics - summary statistics, correlations, stratified sampling, hypothesis
testing, random data generation
• Classification and Regression - linear models (SVMs, logistic regression, linear
regression), naive Bayes, decision trees, ensembles of trees (Random Forests and
Gradient-Boosted Trees)
• Clustering – k-means
• Collaborative Filtering - alternating least squares (ALS)
• Dimensionality Reduction - singular value decomposition (SVD), principal
component analysis (PCA)
• Optimization Primitives - stochastic gradient descent, limited-memory BFGS (L-
BFGS)
• In 1.2, Spark.ml has been introduced in Alpha form
– Provides more uniformity across API
• Programming guide - https://spark.apache.org/docs/1.2.0/mllib-guide.html
Dependencies
• Linear Algebra package – Breeze
• For Python integration you must use NumPy
Spark Streaming
From a DataStax Presentation by Rustam Aliyev
https://academy.datastax.com
@rstml
https://github.com/rstml/datastax-spark-streaming-demo
1. Main Concepts
Message… 9 8 7 6 5 4 3 12
Block 5 Block 4 Block 3Block 6 Block 2 Block 1
… 9 8 7 6 5 4 3 12
Block
200ms200ms
µBatch 2 µBatch 1
µBatchBlock 5 Block 4 Block 3Block 6 Block 2 Block 1
… 9 8 7 6 5 4 3 12
1s
µBatch 1
µBatch
Message7 6 5 4 3 12
Block 2 Block 1
7 6 5 4 3 12
Block 2 Block 1
7 6 5 4 3 12
Block
200ms
1s
• Partitioning of data
• Impacts parallelism
• Default 200ms
• Min recommended 50ms
• Essentially RDD
• Sequence forms Discretized Stream – DStream
• Operation on DStream translates to RDDs
µBatch 1
7 6 5 4 3 12
Block 2 Block 1
7 6 5 4 3 12
Block 2 Block 1
7 6 5 4 3 12
200ms
1s
sparkConf.set("spark.streaming.blockInterval", "200")
new StreamingContext(sparkCtx, Seconds(1))µBatch
Message
Block
Initializing Streaming Context
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
2. Stream Sources
7 6 5 4 3 12
µBatch 1
Block 2 Block 1
7 6 5 4 3 12
DStream
Message Source
Receiver
Receiver
Stream Sources (Receivers)
1. Basic Sources
• fileStream / textFileStream
• actorStream (AKKA)
• queueStream (Queue[RDD])
• rawSocketStream
• socketStream / socketTextStream
2. Advanced Sources
• Kafka
• Twitter
• ZeroMQ
• MQTT
• Flume
• AWS Kinesis
3. Custom
Initializing Socket Stream
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
Initializing Twitter Stream
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)
Custom Receiver (WebSocket)
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val rsvp = ssc.receiverStream(new
WebSocketReceiver("ws://stream.meetup.com/2/rsvps"))
import org.apache.spark.streaming.receiver.Receiver
class WebSocketReceiver(url: String)
extends Receiver[String](storageLevel)
{
// ...
}
3. Transformations
DStream Transformations
Single Stream
map
flatMap
filter
repartition
count
countByValue
reduce
reduceByKey
transform
updateStateByKey
Multiple Streams
union
join
leftOuterJoin
rightOuterJoin
Cogroup
transformWith
Single Stream Transformation
3 2 1
9 8 7 6 5 4 3 12
count
1s1s 1s
3
2
2
4
1
3
1s 1s 1s
* Digits.count()
Digits
Multiple Streams Transformation
2 1
5 4 3 12
union
1s 1s
* Chars.union(Digits)
2 1
E D C AB
2
E 5 D 4
1s
1
C 3 B 2
1s
A 1
Digits
Chars
Word Count
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
4. Window Operations
Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
Window Length = 60 sec
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
...
Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
Window Length = 60s
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
14 13
5s5s
Sliding Interval =
10s
...
Window Length = 60s
Window Operations
• Transformations over a sliding window of data
1. Window Length – duration of the window
2. Sliding Interval – interval at which operation performed
2 1
5s5s
4 3
5s5s
6 5
5s5s
12
5s
14 13
5s5s
16 15
5s5s
Sliding Interval =
10s
...
Window Operations
Window based transformations:
window
countByWindow
countByValueAndWindow
reduceByWindow
reduceByKeyAndWindow
groupByKeyAndWindow
Word Count by Window
import org.apache.spark._
import org.apache.spark.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKeyAndWindow((a:Int,b:Int) => a+b, Seconds(60),
Seconds(10))
Large Window Considerations
• Large windows:
1. Take longer to process
2. Require larger batch interval for stable processing
• Hour-scale windows are not recommended
• For multi-hour aggregations use real data stores (e.g Cassandra)
• Spark Streaming is NOT design to be a persistent data store
• Set spark.cleaner.ttl and spark.streaming.unpersist (be careful)
5. Output Operations
DStream Output Operations
Standard
print
saveAsTextFiles
saveAsObjectFiles
saveAsHadoopFiles
saveAsCassandra*
foreachRDD
persist
Saving to Cassandra
import org.apache.spark._
import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
Start Processing
import org.apache.spark._
import org.apache.spark.streaming._
import com.datastax.spark.connector.streaming._
// Spark connection options
val conf = new SparkConf().setAppName(appName).setMaster(master)
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
scc.start()
scc.awaitTermination()
6. Scalability
Scaling Streaming
• How to scale stream processing?
Kafka
Producer
Spark
Receiver
Spark
Processor
Output
Parallelism – Partitioning
• Partition input stream (e.g. by topics)
• Each receiver can be run on separate worker
Kafka
Topic 2
Spark
Receiver 2
Spark
Processor
Output
Kafka
Topic 3
Spark
Receiver 3
Spark
Processor
Output
Kafka
Topic 1
Spark
Receiver 1
Spark
Processor
Output
Kafka
Topic N
Spark
Receiver N
Spark
Processor
Output
Parallelism – Partitioning
• Partition stream (e.g. by topics)
• Use union() to create single DStream
• Transformations applied on the unified stream
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
Parallelism – RePartitioning
• Explicitly repartition input stream
• Distribute received batches across specified number of machines
Twitter
Producer
Spark
Receiver
Spark
Processor
Output
Spark
Processor
Output
Spark
Processor
Output
Spark
Processor
Output
Parallelism – RePartitioning
• Explicitly repartition input stream
• Distribute received batches across specified number of machines
• Use inputstream.repartition(N)
val numWorkers = 5
val twitterStream = TwitterUtils.createStream(...)
twitterStream.repartition(numWorkers)
Parallelism – Tasks
• Each block processed by separate task
• To increase parallel tasks, increase number of blocks in a batch
• Tasks per Receiver per Batch ≈ Batch Interval / Block Interval
• Example: 2s batch / 200ms block = 10 tasks
• CPU cores will not be utilized if number of tasks is too low
• Consider tuning default number of parallel tasks
spark.default.parallelism
7. Fault Tolerance
Fault Tolerance
To recover streaming operation, Spark needs:
1. RDD data
2. DAG/metadata of DStream
Fault Tolerance – RDD
• Recomputing RDD may be unavailable for stream source
• Protect data by replicating RDD
• RDD replication controlled by org.apache.spark.storage.StorageLevel
• Use storage level with _2 suffix (2 replicas):
– DISK_ONLY_2
– MEMORY_ONLY_2
– MEMORY_ONLY_SER_2
– MEMORY_AND_DISK_2
– MEMORY_AND_DISK_SER_2  Default for most receivers
Fault Tolerance – Checkpointing
• Periodically writes:
1. DAG/metadata of DStream(s)
2. RDD data for some stateful transformations (updateStateByKey &
reduceByKeyAndWindow*)
• Uses fault-tolerant distributed file system for persistence.
• After failure, StreamingContext recreated from checkpoint data
on restart.
• Choose interval carefully as storage will impact processing times.
Fault Tolerance – Checkpointing
import org.apache.spark._
import org.apache.spark.streaming._
val checkpointDirectory = "words.cp" // Directory name for checkpoint data
def createContext(): StreamingContext = {
// streaming with 1 second batch window
val ssc = new StreamingContext(conf, Seconds(1))
val text = ssc.socketTextStream("localhost", "9191")
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
val words = text.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
ssc
}
val conf = new SparkConf().setAppName(appName).setMaster(master)
// Get StreamingContext from checkpoint data or create a new one
val scc = StreamingContext.getOrCreate(checkpointDirectory, createContext _)
scc.start()
scc.awaitTermination()
Fault Tolerance – Checkpointing
$ dse hadoop fs -ls words.cp
Found 11 items
drwxrwxrwx - rustam staff 0 2014-12-21 13:24 /user/rustam/words.cp/b8e8e262-2f8d-4e2f-
ae28-f5cfbadb29bf
-rwxrwxrwx 1 rustam staff 3363 2014-12-21 13:25 /user/rustam/words.cp/checkpoint-
1419168345000
-rwxrwxrwx 1 rustam staff 3368 2014-12-21 13:25 /user/rustam/words.cp/checkpoint-
1419168345000.bk
-rwxrwxrwx 1 rustam staff 3393 2014-12-21 13:25 /user/rustam/words.cp/checkpoint-
1419168350000
-rwxrwxrwx 1 rustam staff 3398 2014-12-21 13:25 /user/rustam/words.cp/checkpoint-
1419168350000.bk
-rwxrwxrwx 1 rustam staff 3422 2014-12-21 13:25 /user/rustam/words.cp/checkpoint-
1419168355000
-rwxrwxrwx 1 rustam staff 3427 2014-12-21 13:25 /user/rustam/words.cp/checkpoint-
1419168355000.bk
-rwxrwxrwx 1 rustam staff 3447 2014-12-21 13:26 /user/rustam/words.cp/checkpoint-
1419168360000
-rwxrwxrwx 1 rustam staff 3452 2014-12-21 13:26 /user/rustam/words.cp/checkpoint-
1419168360000.bk
-rwxrwxrwx 1 rustam staff 3499 2014-12-21 13:26 /user/rustam/words.cp/checkpoint-
1419168365000
-rwxrwxrwx 1 rustam staff 3504 2014-12-21 13:26 /user/rustam/words.cp/checkpoint-
• Verifying checkpoint data on CFS:
Failure Types
• Consider 2 failure scenarios:
Producer Receiver
Processor
RDD replica 1
Output
Processor
RDD replica 2
State of Data
1. Data received and replicated
• Will survive failure of 1 replica
2. Data received but only buffered for replication
• Not replicated yet
• Needs recomputation if lost
Receiver Reliability Types
1. Reliable Receivers
• Receiver acknowledges source only after ensuring that data replicated.
• Source needs to support message ack. E.g. Kafka, Flume.
2. Unreliable Receivers
• Data can be lost in case of failure.
• Source doesn’t support message ack. E.g. Twitter.
Fault Tolerance
• Spark 1.2 adds Write Ahead Log (WAL) support for Streaming
• Protection for Unreliable Receivers
• See SPARK-3129 for architecture details
State / Receiver
Type
Received,
Replicated
Received, Only
Buffered
Reliable Receiver Safe Safe
Unreliable
Receiver
Safe Data Loss
GraphX
• Alpha release
• Provides Graph computation capabilities on
top of RDDs
• Resilient Distributed Property Graph: a
directed multigraph with properties attached
to each vertex and edge.
• The goal of the GraphX project is to unify
graph-parallel and data-parallel computation
in one system with a single composable API.
I am not a Graph-guy yet.
Who here is working with Graph today?
Handy Tools
• Ooyala Spark Job Server -
https://github.com/ooyala/spark-jobserver
• Monitoring with Graphite and Grafana –
http://www.hammerlab.org/2015/02/27/mon
itoring-spark-with-graphite-and-grafana/
An Introduct to Spark - Atlanta Spark Meetup

More Related Content

What's hot

Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsKasper Sørensen
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesJim Hatcher
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 

What's hot (20)

Spark sql
Spark sqlSpark sql
Spark sql
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

Viewers also liked

Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecLoïc Descotte
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overviewDavid Taieb
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2David Taieb
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1David Taieb
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State MachineKnoldus Inc.
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 

Viewers also liked (20)

Scala in practice
Scala in practiceScala in practice
Scala in practice
 
Scala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar ProkopecScala presentation by Aleksandar Prokopec
Scala presentation by Aleksandar Prokopec
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Akka Finite State Machine
Akka Finite State MachineAkka Finite State Machine
Akka Finite State Machine
 
Essentialism
EssentialismEssentialism
Essentialism
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 

Similar to An Introduct to Spark - Atlanta Spark Meetup

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkMuktadiur Rahman
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJUGBD
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 

Similar to An Introduct to Spark - Atlanta Spark Meetup (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark core
Spark coreSpark core
Spark core
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

An Introduct to Spark - Atlanta Spark Meetup

  • 1. Apache Spark, an Introduction Jonathan Lacefield – Solution Architect DataStax
  • 2. Disclaimer The contents of this presentation represent my personal views and do not reflect or represent any views of my employer. This is my take on Spark. This is not DataStax’s take on Spark.
  • 3. Notes • Meetup Sponsor: – Data Exchange Platform – Core Software Engineering – Equifax • Announcement: – Data Exchange Platform is currently hiring to build the next generation data platform. We are looking for people with experience in one or more of the following skills: Spark, Storm, Kafka, samza, Hadoop, Cassandra – How to apply? – Email aravind.yarram@equifax.com
  • 4. Introduction • Jonathan Lacefield – Solutions Architect, DataStax – Former Dev, DBA, Architect, reformed PM – Email: jlacefie@gmail.com – Twitter: @jlacefie – LinkedIn: www.linkedin.com/in/jlacefield This deck represents my own views and not the views of my employer
  • 5. DataStax Introduction DataStax delivers Apache Cassandra in a database platform purpose built for the performance and availability demands of IOT, web, and mobile applications, giving enterprises a secure always-on database that remains operationally simple when scaled in a single datacenter or across multiple datacenters and clouds. Includes 1. Apache Cassandra 2. Apache Spark 3. Apache SOLR 4. Apache Hadoop 5. Graph Coming Soon
  • 6. DataStax, What we Do (Use Cases) • Fraud Detection • Personalization • Internet of Things • Messaging • Lists of Things (Products, Playlists, etc) • Smaller set of other things too! We are all about working with temporal data sets at large volumes with high transaction counts (velocity).
  • 7. Agenda • Set Baseline (Pre Distributed Days and Hadoop) • Spark Conceptual Introduction • Spark Key Concepts (Core) • Spark Look at Each Module – Spark SQL – MLIB – Spark Streaming – GraphX
  • 8. In the Beginning…. OLTP Web Application Tier OLAP Statistical/Analytical Applications ETL
  • 10. Along Came Hadoop with ….
  • 12. Lifecycle of a MapReduce Job
  • 14. • Started in 2009 in Berkley’s AMP Lab • Open Sources in 2010 • Commercial Provider is Databricks – http://databricks.com • Solve 2 Big Hadoop Pain Points Speed - In Memory and Fault Tolerant Ease of Use – API of operations and datasets
  • 15. Use Cases for Apache Spark • Data ETL • Interactive dashboard creation for customers • Streaming (e.g., fraud detection, real-time video optimization) • “Complex analytics” (e.g., anomaly detection, trend analysis)
  • 16. Key Concepts - Core • Resilient Distributed Datasets (RDDs) – Spark’s datasets • Spark Context – Provides information on the Spark environment and the application • Transformations - Transforms data • Actions - Triggers actual processing • Directed Acyclic Graph (DAG) – Spark’s execution algorithm • Broadcast Variables – Read only variables on Workers • Accumulators – Variables that can be added to with an associated function on Workers • Driver - “Main” application container for Spark Execution • Executors – Execute tasks on data • Resource Manager – Manages task assignment and status • Worker – Execute and Cache
  • 17. Resilient Distributed Datasets (RDDs) • Fault tolerant collection of elements that enable parallel processing • Spark’s Main Abstraction • Transformation and Actions are executed against RDDs • Can persist in Memory, on Disk, or both • Can be partitioned to control parallel processing • Can be reused – HUGE Efficiencies with processing
  • 18. RDDs - Resilient Source – databricks.com HDFS File Filtered RDD Mapped RDD filter (func = someFilter(…)) map (func = someAction(...)) RDDs track lineage information that can be used to efficiently recompute lost data
  • 19. RDDs - Distributed Image Source - http://1.bp.blogspot.com/-jjuVIANEf9Y/Ur3vtjcIdgI/AAAAAAAABC0/-Ou9nANPeTs/s1600/p1.pn
  • 20. RDDs – From the API val someRdd = sc.textFile(someURL) • Create an RDD from a text file val lines = sc.parallelize(List("pandas", "i like pandas")) • Create an RDD from a list of elements • Can create RDDs from many different sources • RDDs can, and should, be persisted in most cases – lines.persist() or lines.cache() • See here for more info – http://spark.apache.org/docs/1.2.0/programming-guide.html
  • 21. Transformations • Create one RDD and transform the contents into another RDD • Examples – Map – Filter – Union – Distinct – Join • Complete list - http://spark.apache.org/docs/1.2.0/programming-guide.html • Lazy execution – Transformations aren’t applied to an RDD until an Action is executed inputRDD = sc.textFile("log.txt") errorsRDD = inputRDD.filter(lambda x: "error" in x)
  • 22. Actions • Cause data to be returned to driver or saved to output • Cause data retrieval and execution of all Transformations on RDDs • Common Actions – Reduce – Collect – Take – SaveAs…. • Complete list - http://spark.apache.org/docs/1.2.0/programming- guide.html • errorsRDD.take(1)
  • 23. Example App import sys from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext( “local”, “WordCount”, sys.argv[0], None) lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda s: s.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x + y) counts.saveAsTextFile(sys.argv[2]) Based on source from – databricks.com 1 2 3
  • 24. Conceptual Representation RDD RDD RDD RDD Transformations Action Value counts = lines.flatMap(lambda s: s.split(“ ”)) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x + y) counts.saveAsTextFile(sys.argv[2]) lines = sc.textFile(sys.argv[1]) Based on source from – databricks.com 1 2 3
  • 25. Spark Execution Image Source – Learning Spark http://shop.oreilly.com/product/0636920028512.do
  • 27. Spark SQL Abstraction of Spark API to support SQL like interaction Parse Analyze LogicalPlan Optimize Spark SQL HiveQL PhysicalPlan Execute Catalyst SQL Core • Programming Guide - https://spark.apache.org/docs/1.2.0/sql-programming-guide.html • Used for code source in examples • Catalyst - http://spark-summit.org/talk/armbrust-catalyst-a-query-optimization-framework-for-spark-and-shark/
  • 28. SQLContext and SchemaRDD val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD SchemaRDD can be created 1) Using reflection to infer schema Structure from an existing RDD 2) Programmable interface to create Schema and apply to an RDD
  • 29. SchemaRDD Creation - Reflection // sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD // Define the schema using a case class. // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
  • 30. SchemaRDD Creation - Explicit // sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Create an RDD val people = sc.textFile("examples/src/main/resources/people.txt") // The schema is encoded in a string val schemaString = "name age" // Import Spark SQL data types and Row. import org.apache.spark.sql._ // Generate the schema based on the string of schema val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) // Convert records of the RDD (people) to Rows. val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) // Apply the schema to the RDD. val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema) // Register the SchemaRDD as a table. peopleSchemaRDD.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val results = sqlContext.sql("SELECT name FROM people") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. results.map(t => "Name: " + t(0)).collect().foreach(println)
  • 31. Data Frames • Data Frames will replace SchemaRDD • https://databricks.com/blog/2015/02/17/intr oducing-dataframes-in-spark-for-large-scale- data-science.html
  • 33. Once Schema Exists on and RDD It’s either Spark SQL or HiveQL Can use Thrift ODBC/JDBC for Remote Execution
  • 34. MLib • Scalable, distributed, Machine Learning library • Base Statistics - summary statistics, correlations, stratified sampling, hypothesis testing, random data generation • Classification and Regression - linear models (SVMs, logistic regression, linear regression), naive Bayes, decision trees, ensembles of trees (Random Forests and Gradient-Boosted Trees) • Clustering – k-means • Collaborative Filtering - alternating least squares (ALS) • Dimensionality Reduction - singular value decomposition (SVD), principal component analysis (PCA) • Optimization Primitives - stochastic gradient descent, limited-memory BFGS (L- BFGS) • In 1.2, Spark.ml has been introduced in Alpha form – Provides more uniformity across API • Programming guide - https://spark.apache.org/docs/1.2.0/mllib-guide.html
  • 35. Dependencies • Linear Algebra package – Breeze • For Python integration you must use NumPy
  • 36. Spark Streaming From a DataStax Presentation by Rustam Aliyev https://academy.datastax.com @rstml
  • 39. Message… 9 8 7 6 5 4 3 12 Block 5 Block 4 Block 3Block 6 Block 2 Block 1 … 9 8 7 6 5 4 3 12 Block 200ms200ms µBatch 2 µBatch 1 µBatchBlock 5 Block 4 Block 3Block 6 Block 2 Block 1 … 9 8 7 6 5 4 3 12 1s
  • 40. µBatch 1 µBatch Message7 6 5 4 3 12 Block 2 Block 1 7 6 5 4 3 12 Block 2 Block 1 7 6 5 4 3 12 Block 200ms 1s • Partitioning of data • Impacts parallelism • Default 200ms • Min recommended 50ms • Essentially RDD • Sequence forms Discretized Stream – DStream • Operation on DStream translates to RDDs
  • 41. µBatch 1 7 6 5 4 3 12 Block 2 Block 1 7 6 5 4 3 12 Block 2 Block 1 7 6 5 4 3 12 200ms 1s sparkConf.set("spark.streaming.blockInterval", "200") new StreamingContext(sparkCtx, Seconds(1))µBatch Message Block
  • 42. Initializing Streaming Context import org.apache.spark._ import org.apache.spark.streaming._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1))
  • 44. 7 6 5 4 3 12 µBatch 1 Block 2 Block 1 7 6 5 4 3 12 DStream Message Source Receiver Receiver
  • 45. Stream Sources (Receivers) 1. Basic Sources • fileStream / textFileStream • actorStream (AKKA) • queueStream (Queue[RDD]) • rawSocketStream • socketStream / socketTextStream 2. Advanced Sources • Kafka • Twitter • ZeroMQ • MQTT • Flume • AWS Kinesis 3. Custom
  • 46. Initializing Socket Stream import org.apache.spark._ import org.apache.spark.streaming._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val text = ssc.socketTextStream("localhost", "9191")
  • 47. Initializing Twitter Stream import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val tweets = TwitterUtils.createStream(ssc, auth)
  • 48. Custom Receiver (WebSocket) import org.apache.spark._ import org.apache.spark.streaming._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val rsvp = ssc.receiverStream(new WebSocketReceiver("ws://stream.meetup.com/2/rsvps")) import org.apache.spark.streaming.receiver.Receiver class WebSocketReceiver(url: String) extends Receiver[String](storageLevel) { // ... }
  • 51. Single Stream Transformation 3 2 1 9 8 7 6 5 4 3 12 count 1s1s 1s 3 2 2 4 1 3 1s 1s 1s * Digits.count() Digits
  • 52. Multiple Streams Transformation 2 1 5 4 3 12 union 1s 1s * Chars.union(Digits) 2 1 E D C AB 2 E 5 D 4 1s 1 C 3 B 2 1s A 1 Digits Chars
  • 53. Word Count import org.apache.spark._ import org.apache.spark.streaming._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val text = ssc.socketTextStream("localhost", "9191") val words = text.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)) .reduceByKey(_ + _)
  • 55. Window Operations • Transformations over a sliding window of data 1. Window Length – duration of the window 2. Sliding Interval – interval at which operation performed Window Length = 60 sec 2 1 5s5s 4 3 5s5s 6 5 5s5s 12 5s ...
  • 56. Window Operations • Transformations over a sliding window of data 1. Window Length – duration of the window 2. Sliding Interval – interval at which operation performed Window Length = 60s 2 1 5s5s 4 3 5s5s 6 5 5s5s 12 5s 14 13 5s5s Sliding Interval = 10s ...
  • 57. Window Length = 60s Window Operations • Transformations over a sliding window of data 1. Window Length – duration of the window 2. Sliding Interval – interval at which operation performed 2 1 5s5s 4 3 5s5s 6 5 5s5s 12 5s 14 13 5s5s 16 15 5s5s Sliding Interval = 10s ...
  • 58. Window Operations Window based transformations: window countByWindow countByValueAndWindow reduceByWindow reduceByKeyAndWindow groupByKeyAndWindow
  • 59. Word Count by Window import org.apache.spark._ import org.apache.spark.streaming._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val text = ssc.socketTextStream("localhost", "9191") val words = text.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)) .reduceByKeyAndWindow((a:Int,b:Int) => a+b, Seconds(60), Seconds(10))
  • 60. Large Window Considerations • Large windows: 1. Take longer to process 2. Require larger batch interval for stable processing • Hour-scale windows are not recommended • For multi-hour aggregations use real data stores (e.g Cassandra) • Spark Streaming is NOT design to be a persistent data store • Set spark.cleaner.ttl and spark.streaming.unpersist (be careful)
  • 63. Saving to Cassandra import org.apache.spark._ import org.apache.spark.streaming._ import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val text = ssc.socketTextStream("localhost", "9191") val words = text.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total"))
  • 64. Start Processing import org.apache.spark._ import org.apache.spark.streaming._ import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf().setAppName(appName).setMaster(master) // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val text = ssc.socketTextStream("localhost", "9191") val words = text.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total")) scc.start() scc.awaitTermination()
  • 66. Scaling Streaming • How to scale stream processing? Kafka Producer Spark Receiver Spark Processor Output
  • 67. Parallelism – Partitioning • Partition input stream (e.g. by topics) • Each receiver can be run on separate worker Kafka Topic 2 Spark Receiver 2 Spark Processor Output Kafka Topic 3 Spark Receiver 3 Spark Processor Output Kafka Topic 1 Spark Receiver 1 Spark Processor Output Kafka Topic N Spark Receiver N Spark Processor Output
  • 68. Parallelism – Partitioning • Partition stream (e.g. by topics) • Use union() to create single DStream • Transformations applied on the unified stream val numStreams = 5 val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) } val unifiedStream = streamingContext.union(kafkaStreams) unifiedStream.print()
  • 69. Parallelism – RePartitioning • Explicitly repartition input stream • Distribute received batches across specified number of machines Twitter Producer Spark Receiver Spark Processor Output Spark Processor Output Spark Processor Output Spark Processor Output
  • 70. Parallelism – RePartitioning • Explicitly repartition input stream • Distribute received batches across specified number of machines • Use inputstream.repartition(N) val numWorkers = 5 val twitterStream = TwitterUtils.createStream(...) twitterStream.repartition(numWorkers)
  • 71. Parallelism – Tasks • Each block processed by separate task • To increase parallel tasks, increase number of blocks in a batch • Tasks per Receiver per Batch ≈ Batch Interval / Block Interval • Example: 2s batch / 200ms block = 10 tasks • CPU cores will not be utilized if number of tasks is too low • Consider tuning default number of parallel tasks spark.default.parallelism
  • 73. Fault Tolerance To recover streaming operation, Spark needs: 1. RDD data 2. DAG/metadata of DStream
  • 74. Fault Tolerance – RDD • Recomputing RDD may be unavailable for stream source • Protect data by replicating RDD • RDD replication controlled by org.apache.spark.storage.StorageLevel • Use storage level with _2 suffix (2 replicas): – DISK_ONLY_2 – MEMORY_ONLY_2 – MEMORY_ONLY_SER_2 – MEMORY_AND_DISK_2 – MEMORY_AND_DISK_SER_2  Default for most receivers
  • 75. Fault Tolerance – Checkpointing • Periodically writes: 1. DAG/metadata of DStream(s) 2. RDD data for some stateful transformations (updateStateByKey & reduceByKeyAndWindow*) • Uses fault-tolerant distributed file system for persistence. • After failure, StreamingContext recreated from checkpoint data on restart. • Choose interval carefully as storage will impact processing times.
  • 76. Fault Tolerance – Checkpointing import org.apache.spark._ import org.apache.spark.streaming._ val checkpointDirectory = "words.cp" // Directory name for checkpoint data def createContext(): StreamingContext = { // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) val text = ssc.socketTextStream("localhost", "9191") ssc.checkpoint(checkpointDirectory) // set checkpoint directory val words = text.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.saveToCassandra("keyspace", "table", SomeColumns("word", "total")) ssc } val conf = new SparkConf().setAppName(appName).setMaster(master) // Get StreamingContext from checkpoint data or create a new one val scc = StreamingContext.getOrCreate(checkpointDirectory, createContext _) scc.start() scc.awaitTermination()
  • 77. Fault Tolerance – Checkpointing $ dse hadoop fs -ls words.cp Found 11 items drwxrwxrwx - rustam staff 0 2014-12-21 13:24 /user/rustam/words.cp/b8e8e262-2f8d-4e2f- ae28-f5cfbadb29bf -rwxrwxrwx 1 rustam staff 3363 2014-12-21 13:25 /user/rustam/words.cp/checkpoint- 1419168345000 -rwxrwxrwx 1 rustam staff 3368 2014-12-21 13:25 /user/rustam/words.cp/checkpoint- 1419168345000.bk -rwxrwxrwx 1 rustam staff 3393 2014-12-21 13:25 /user/rustam/words.cp/checkpoint- 1419168350000 -rwxrwxrwx 1 rustam staff 3398 2014-12-21 13:25 /user/rustam/words.cp/checkpoint- 1419168350000.bk -rwxrwxrwx 1 rustam staff 3422 2014-12-21 13:25 /user/rustam/words.cp/checkpoint- 1419168355000 -rwxrwxrwx 1 rustam staff 3427 2014-12-21 13:25 /user/rustam/words.cp/checkpoint- 1419168355000.bk -rwxrwxrwx 1 rustam staff 3447 2014-12-21 13:26 /user/rustam/words.cp/checkpoint- 1419168360000 -rwxrwxrwx 1 rustam staff 3452 2014-12-21 13:26 /user/rustam/words.cp/checkpoint- 1419168360000.bk -rwxrwxrwx 1 rustam staff 3499 2014-12-21 13:26 /user/rustam/words.cp/checkpoint- 1419168365000 -rwxrwxrwx 1 rustam staff 3504 2014-12-21 13:26 /user/rustam/words.cp/checkpoint- • Verifying checkpoint data on CFS:
  • 78. Failure Types • Consider 2 failure scenarios: Producer Receiver Processor RDD replica 1 Output Processor RDD replica 2
  • 79. State of Data 1. Data received and replicated • Will survive failure of 1 replica 2. Data received but only buffered for replication • Not replicated yet • Needs recomputation if lost
  • 80. Receiver Reliability Types 1. Reliable Receivers • Receiver acknowledges source only after ensuring that data replicated. • Source needs to support message ack. E.g. Kafka, Flume. 2. Unreliable Receivers • Data can be lost in case of failure. • Source doesn’t support message ack. E.g. Twitter.
  • 81. Fault Tolerance • Spark 1.2 adds Write Ahead Log (WAL) support for Streaming • Protection for Unreliable Receivers • See SPARK-3129 for architecture details State / Receiver Type Received, Replicated Received, Only Buffered Reliable Receiver Safe Safe Unreliable Receiver Safe Data Loss
  • 82. GraphX • Alpha release • Provides Graph computation capabilities on top of RDDs • Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge. • The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API.
  • 83. I am not a Graph-guy yet. Who here is working with Graph today?
  • 84. Handy Tools • Ooyala Spark Job Server - https://github.com/ooyala/spark-jobserver • Monitoring with Graphite and Grafana – http://www.hammerlab.org/2015/02/27/mon itoring-spark-with-graphite-and-grafana/

Editor's Notes

  1. 3 mins
  2. 10 mins
  3. 15 mins