The document discusses Spark Streaming and its execution model. It describes how Spark Streaming receives data streams, divides them into micro-batches, and processes the micro-batches using Spark. It also covers transformations and actions that can be performed on DStreams, window operations, and deployment options for Spark Streaming including local, standalone, and cluster modes.
5. @maasg www. .com
PUB / SUB
MQTT / WebSockets
RAW
Storage
Storage
` ` `
Query
Notebook Server
How Spark is Driving a New Loosely-coupled Stand-alone Service
Virdata’s Full StackVirdata’s Spark as a Service
38. @maasg www. .com
Deployment Options
M
Local
Standalone
Cluster
WW W
Using a Cluster
Manager
W
spark.master=local[*] spark.master=spark://host:port spark.master=mesos://host:port
M
M
D
W
D
W
W
39. @maasg www. .com
Deployment Options
M
Local
Standalone
Cluster
Using a Cluster
Manager
spark.master=local[*] spark.master=mesos://host:port
M
M
DD
Rec
Exec
Rec
Exec
Exec
Exec
Rec
Exec
Rec
Exec
ExecExec
spark.master=spark://host:port
56. @maasg www. .com
Kafka:The Receiver-less model
Simplified Parallelism
Efficiency
Exactly-once semantics
Less degrees of freedom
val directKafkaStream =
KafkaUtils.createDirectStream[
[key class],
[value class],
[key decoder class],
[value decoder class] ](
streamingContext, [map of Kafka parameters], [set
of topics to consume]
)
spark.streaming.kafka.maxRatePerPartition
57. @maasg www. .com
Kafka:The Receiver-less model
src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
58. @maasg www. .com
Delivery Semantics
• Spark Streaming Receiver-based (<v1.2 ) Roughly at least once
• Spark Streaming Recv w/ WAL At least once + Zero Data Loss
• Spark Streaming Direct At least once + Zero Data Loss
• Spark Streaming Direct
+ Offset management Exactly Once
+ Idempotent Writes | Transactions
59. @maasg www. .com
Spark Streaming (v1.5) made Reactive
proportional-integral-derivative controller (PID controller)
Backpressure support