Databricks Spark Chief Architect Reynold Xin's keynote at Spark Summit East 2016, discussing streaming, continuous applications, and DataFrames in Spark.
Databricks Spark Chief Architect Reynold Xin's keynote at Spark Summit East 2016, discussing streaming, continuous applications, and DataFrames in Spark.
5.
StreamingSQL MLlib
Spark Core
GraphXStreaming
Introduced3 years ago in Spark 0.7
50% usersconsider most important part of Spark
Spark Unified Stack
6.
Spark Streaming
• First attempt at unifying streaming and batch
• State management built in
• Exactly once semantics
• Features required for large clusters
• Straggler mitigation,dynamic load balancing,fast fault-recovery
8.
Use Case: Fraud Detection
STREAM
ANOMALY
Machine learningmodel
continuously updates
to detectnew anomalies
Ad-hocanalyze historic data
9.
Continuous Application
noun.
An end-to-end application that acts on real-time data.
10.
Challenges Building Continuous
Applications
Integration with non-streaming systems often an after-thought
• Interactive,batch,relational databases, machine learning,…
Streaming programming models are complex
11.
Integration Example
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?
• Late events
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
12.
Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover time & correctness?
Data
Late arrival, varying distribution overtime, …
14.
The simplest way to perform streaming analytics
is not having to reason about streaming.
15.
Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API !
16.
Structured Streaming
High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models
17.
output for
data at 1
Result
Query
Time
data up
to PT 1
Input
complete
output
Output
1 2 3
Trigger: every 1 sec
data up
to PT 2
output for
data at 2
data up
to PT 3
output for
data at 3
Model
18.
delta
output
output for
data at 1
Result
Query
Time
data up
to PT 2
data up
to PT 3
data up
to PT 1
Input
output for
data at 2
output for
data at 3
Output
1 2 3
Trigger: every 1 sec
Model
19.
Model Details
Input sources:append-onlytables
Queries: newoperators for windowing, sessions, etc
Triggers:based on time (e.g. every 1 sec)
Output modes: complete, deltas, update-in-place
20.
Example: ETL
Input: files in S3
Query: map (transform each record)
Trigger: “every5 sec”
Output mode: “newrecords”,into S3 sink
21.
Example: Page View Count
Input: recordsin Kafka
Query: select count(*) group by page, minute(evtime)
Trigger: “every5 sec”
Output mode: “update-in-place”, into MySQL sink
Note: this will automatically update “old” recordson late data!
22.
Logically:
DataFrame operations on static data
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
Logical Plan
Continuous,
incremental execution
Catalyst optimizer
Execution
26.
Rest of Spark will follow
• Interactive queriesshould just work
• Spark’s data sourceAPI will be updated to support seamless
streaming integration
• Exactly once semantics end-to-end
• Different outputmodes (complete,delta, update-in-place)
• ML algorithms will be updated too
27.
What can we do with this that’s hard
with other engines?
Ad-hoc, interactive queries
Dynamic changing queries
Benefits of Spark: elastic scaling, stragglermitigation, etc
28.
Use Case: Fraud Detection
STREAM
ANOMALY
Machine LearningModel
continuously updates
to detectnew anomalies
Analyze Historic Data
29.
Timeline
Spark 2.0
• API foundation
• Kafka, file systems, and
databases
• Event-time aggregations
Spark 2.1 +
• Continuous SQL
• BI app integration
• Other streaming sources/ sinks
• Machine learning
Il semblerait que vous ayez déjà ajouté cette diapositive à .
Créer un clipboard
Vous avez clippé votre première diapositive !
En clippant ainsi les diapos qui vous intéressent, vous pourrez les revoir plus tard. Personnalisez le nom d’un clipboard pour mettre de côté vos diapositives.
Créer un clipboard
Partager ce SlideShare
Offre spéciale pour les lecteurs de SlideShare
Juste pour vous: Essai GRATUIT de 60 jours dans la plus grande bibliothèque numérique du monde.
La famille SlideShare vient de s'agrandir. Profitez de l'accès à des millions de livres numériques, livres audio, magazines et bien plus encore sur Scribd.