SlideShare une entreprise Scribd logo
1  sur  77
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage, ADF, SQL DB, AAD)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs – 99.95%)
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark Structured
Streaming
Stream processing
Spark MLlib
Machine
Learning
Yarn Mesos
Standalone
Scheduler
MLlib
Machine
Learning
Streaming
Stream processing
GraphX
Graph
Computation
Data Sources (Azure Storage, Cosmos DB, SQL)
Cluster Manager
Worker Node Worker Node Worker Node
Driver Program
SparkContext
COMPLEX DATA
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
COMPLEX
SYSTEMS
Diverse storage
systems (Kafka, Azure
Storage,Event Hubs, SQL
DW, …)
System failures
COMPLEX
WORKLOADS
Combining streaming with
interactive queries
Machine learning
you
should not have to
reason about streaming
you
should write simple queries
&
Azure Databricks
should continuously update the answer
file
dump
seconds hours
table
10101010
file
dump
seconds hours
table
10101010
table
10101010 seconds
• High-Level APIs — DataFrames, Datasets and SQL. Same in streaming
and in batch.
• Event-time Processing — Native support for working with out-of-order and
late data.
• End-to-end Exactly Once — Transactional both in processing and output.
Read
Stream
Operations
Output Analysis
eventsDF = spark.readStream
.format("kafka")
.option(“subscribe", “myTopic")
.option("startingOffsets",
"earliest")
.load()
)
Source
• Specify readStream
• Specify format
• Specify one or more
locations to read data from
• Provide an offset
• using union()
Read
Stream
Operations
Output Analysis
DataFrames
eventsDF
.where("event_type = 'view'")
.join(table("campaigns"), "ad_id")
.groupBy(
'campaign_id)
.count()
Spark SQL
SELECT campaign_id, count(*)
FROM eventsDataFrame
JOIN
campaigns
ON (campaigns.ad_id = events.ad_id)
WHERE event_type = "view"
GROUP BY "campaign_id"
eventsDF
.filter( clicks > 9999)
.writeStream
.foreach(new EmailAlert())
E.g. Trigger an Alert when clicks > 9999
Latency: ~100 ms
Generate Email Alert
Read
Stream
Operations
Output Analysis
filteredEvents
.where("eventType = ‘view’")
to_json
filteredLogs
"topic", “views"
Forward filtered and augmented
events back to Kafka
Latency: ~100ms average
Filter
to_json() to convert
columns back into
json string, and then
save as different
Kafka topic
ETL
Store augmented stream as
efficient columnar data for later
processing
Latency: ~1 minute
.repartition(1)
.trigger("1 minute")
Buffer data and
write one large
file every minute
for efficient reads
Read
Stream
Operations
Output Analysis
complex, ad-hoc
queries on
latest
data
seconds!
parquet.`/data/events`
Trouble shoot problems as they
occur with latest information
Latency: ~1 minute
Ad-hoc
Analysis
will read latest data
when query executed
 Delta is a new Table Format that solves problems with Parquet
 Delta brings data reliability and performance optimizations to the
cloud data lake.
● Improves performance by
organizing data and creating
metadata
- Manages optimal files sizes
- Stores statistics that enable data
skipping
- Creates and maintains indexes
● Improves data reliability by
adding data management
capabilities
- Transactional guarantees simplify data
pipelines
- Schema enforcement ensures data is
not corrupted
- Upserts makes fixing bad data simple
https://docs.azuredatabricks.net/
https://docs.microsoft.com/en-
us/azure/event-hubs/event-hubs-what-is-event-hubs
https://docs.microsoft.com/en-
us/azure/hdinsight/kafka/apache-kafka-introduction
https://docs.microsoft.com/en-us/azure/cosmos-
db/introduction
 Unifies streaming, interactive and batch queries—a single API for both
static bounded data and streaming unbounded data.
 Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used
for batch processing of static data.
 Runs incrementally and continuously and updates the results as data
streams in.
 Supports app development in Scala, Java, Python and R.
 Supports streaming aggregations, event-time windows, windowed
grouped aggregation, stream-to-batch joins.
 Features streaming deduplication, multiple output modes and APIs for
managing/monitoring streaming queries.
 Built-in sources: Kafka, File source (json, csv, text, parquet)
A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Source
• Specify one or more
locations to read data from
• Built in support for
Files/Kafka/Socket,
pluggable.
• Can include multiple
sources of different types
using union()
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Transformation
• Using DataFrames,
Datasets and/or SQL.
• Catalyst figures out how
to execute the
transformation
incrementally.
• Internal processing
always exactly-once.
DataFrames,
Datasets,
SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Write to
Kafka
Series of
Incremental
Execution Plans
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Kafka
Sink
Optimized
Physical
Plan
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Sink
• Accepts the output of
each batch.
• When supported sinks
are transactional and
exactly once (Files).
• Use foreach to execute
arbitrary code.
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Output mode – What's
output
• Complete – Output the whole
answer every time
• Update – Output changed rows
• Append – Output new rows only
Trigger – When to output
• Specified as a time, eventually
supports data size
• No trigger means as fast as
possible
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Checkpoint
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure.
Checkpointing tracks progress
(offsets) of consuming data from the
source and intermediate state.
Offsets and metadata saved as JSON
Can resume after changing your streaming
transformations
end-to-end
exactly-once
guarantees
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
write
ahead
log
Specify options to configure
How?
kafka.boostrap.servers => broker1,broker2
What?
subscribe => topic1,topic2,topic3 // fixed list of topics
subscribePattern => topic* // dynamic list of topics
assign => {"topicA":[0,1] } // specific partitions
Where?
startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...
)
.option("subscribe", "topic")
.load()
rawData dataframe has the
following columns
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...
)
.option("subscribe", "topic")
.load()
key value topic partitio
n
offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
Cast binary value to string
Name it column json
val parsedData = rawData
.selectExpr("cast (value as string) as
json")
.select(from_json("json",
schema).as("data"))
.select("data.*")
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
val parsedData = rawData
.selectExpr("cast (value as string) as json"
.select(from_json("json", schema).as("data")
.select("data.*")
json
{ "timestamp": 1486087873, "device":
"devA", …}
{ "timestamp": 1486082418, "device":
"devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedData = rawData
.selectExpr("cast (value as string) as json"
.select(from_json("json", schema).as("data")
.select("data.*")
data (nested)
timestam
p
device …
14860878
73
devA …
14860867
21
devX …
timestam
p
device …
14860878
73
devA …
14860867
21
devX …
select("data.*")
(not
nested)
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedData = rawData
.selectExpr("cast (value as string) as
json")
.select(from_json("json",
schema).as("data"))
.select("data.*")
powerful built-in APIs to
perform complex data
transformations
from_json, to_json, explode, ...
100s of functions
(see our blog post)
Writing to
Save parsed data as Parquet table
in the given path
Partition files by date so that future
queries on time slices of data is fast
e.g. query on last 48 hours of data
val query = parsedData.writeStream
.option("checkpointLocation",
...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable")
Checkpointing
Enable checkpointing by
setting the checkpoint location
to save offset logs
start actually starts a
continuous running
StreamingQuery in the Spark
cluster
val query = parsedData.writeStream
.option("checkpointLocation", ...
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
Streaming Query
query is a handle to the continuously running
StreamingQuery
Used to monitor and manage the execution
val query = parsedData.writeStream
.option("checkpointLocation", ...
.format("parquet")
.partitionBy("date")
.start("/parquetTable")/")
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
StreamingQuery
Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within seconds
Parquet table is updated atomically, ensures prefix integrity
Even if distributed, ad-hoc queries will see either all updates from
streaming query or none, read more in our blog
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
complex, ad-hoc
queries on
latest
data
seconds!
More Kafka Support [Spark 2.2]
Write out to Kafka
Dataframe must have binary fields named
key and value
Direct, interactive and batch
queries on Kafka
Makes Kafka even more powerful as a
storage platform!
result.writeStream
.format("kafka")
.option("topic", "output")
.start()
val df = spark
.read // not readStream
.format("kafka")
.option("subscribe", "topic")
.load()
df.registerTempTable("topicData")
spark.sql("select value from topicData")
Working With Time
Event Time
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in the 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
number of records every hour
Support UDAFs!
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10
mins"))
.avg("signal")avg signal strength of each
device every 10 mins
Stateful Processing for Aggregations
Aggregates has to be saved as
distributed state between triggers
Each trigger reads previous state and writes
updated state
State stored in memory,
backed by write ahead log in HDFS/S3
Fault-tolerant, exactly-once guarantee!
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
state state
write
ahea
d log
state updates
are written to
log for checkpointing
state
Automatically handles Late Data
12:00 -
13:00
1 12:00 -
13:00
3
13:00 -
14:00
1
12:00 -
13:00
3
13:00 -
14:00
2
14:00 -
15:00
5
12:00 -
13:00
5
13:00 -
14:00
2
14:00 -
15:00
5
15:00 -
16:00
4
12:00 -
13:00
3
13:00 -
14:00
2
14:00 -
15:00
6
15:00 -
16:00
4
16:00 -
17:00
3
13:0
0
14:0
0
15:0
0
16:0
0
17:0
0
Keeping state
allows late data to
update counts of
old windows
red = state updated
with late data
But size of the state increases
indefinitely if old windows are not
dropped
Watermarking
Watermark - moving threshold of how
late data is expected to be and when
to drop old state
Trails behind max seen event time
Trailing gap is configurable
event
time
max event
time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing
gap
of 10 mins
Watermarking
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is "too late"
and dropped
Windows older than watermark
automatically deleted to limit the
amount of intermediate state
max event
time
event
time
watermark
late data
allowed to
aggregate
data too
late,
dropped
Watermarking
max event
time
event
time
watermark
allowed
lateness
of 10
mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5
minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
Useful only in stateful operations
(streaming aggs, dropDuplicates, mapGroupsWithState,
...)
Ignored in non-stateful streaming
queries and batch queries
Watermarking
data too late,
ignored in counts,
state dropped
Processing Time12:0
0
12:0
5
12:1
0
12:1
5
12:10 12:15 12:20
12:0
7
12:1
3
12:0
8
EventTime
12:1
5
12:1
8
12:0
4
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in
counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event
time
12:0
8
wm =
12:04
10
min
12:1
4
More details in my blog post
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
Processing Details
separated
from
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
How late can data be?
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
How late can data be?
How often to emit updates?
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful function to a
user-defined state
Direct support for per-key
timeouts in event-time or
processing-time
Supports Scala and Java
73
ds.groupByKey(_.id)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,
values: Iterator[V],
state: GroupState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}
Other interesting operations
Streaming Deduplication
Watermarks to limit state
Stream-batch Joins
Stream-stream Joins
Can use mapGroupsWithState
Direct support oming soon!
val batchData = spark.read
.format("parquet")
.load("/additional-data")
parsedData.join(batchData, "device")
parsedData.dropDuplicates("eventId")
Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines.
Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines.
Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines.

Contenu connexe

Tendances

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksLace Lofranco
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQLMichael Rys
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAlex Tumanoff
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksLace Lofranco
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
A developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure DatabricksA developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure DatabricksMicrosoft Tech Community
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Accessing Google Cloud APIs
Accessing Google Cloud APIsAccessing Google Cloud APIs
Accessing Google Cloud APIswesley chun
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...Erwin de Kreuk
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL DatabaseJames Serra
 
Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Eric Bragas
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionDenys Chamberland
 

Tendances (20)

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
Building Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure DatabricksBuilding Advanced Analytics Pipelines with Azure Databricks
Building Advanced Analytics Pipelines with Azure Databricks
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
A developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure DatabricksA developer's introduction to big data processing with Azure Databricks
A developer's introduction to big data processing with Azure Databricks
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Accessing Google Cloud APIs
Accessing Google Cloud APIsAccessing Google Cloud APIs
Accessing Google Cloud APIs
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...Is there a way that we can build our Azure Data Factory all with parameters b...
Is there a way that we can build our Azure Data Factory all with parameters b...
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2Deep Dive into Azure Data Factory v2
Deep Dive into Azure Data Factory v2
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in Action
 

Similaire à Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines.

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practicesfelixcss
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Service messaging using Kafka
Service messaging using KafkaService messaging using Kafka
Service messaging using KafkaRobert Vadai
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps_Fest
 

Similaire à Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines. (20)

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Service messaging using Kafka
Service messaging using KafkaService messaging using Kafka
Service messaging using Kafka
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
 

Plus de Microsoft Tech Community

Removing Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessRemoving Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessMicrosoft Tech Community
 
Building mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinBuilding mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinMicrosoft Tech Community
 
Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Microsoft Tech Community
 
Interactive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsInteractive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsMicrosoft Tech Community
 
Unlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIUnlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIMicrosoft Tech Community
 
Break through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsBreak through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsMicrosoft Tech Community
 
Multiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMultiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMicrosoft Tech Community
 
Media Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMedia Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMicrosoft Tech Community
 
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityReal-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityMicrosoft Tech Community
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightMicrosoft Tech Community
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIMicrosoft Tech Community
 
Mobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMicrosoft Tech Community
 
Cognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionCognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionMicrosoft Tech Community
 

Plus de Microsoft Tech Community (20)

100 ways to use Yammer
100 ways to use Yammer100 ways to use Yammer
100 ways to use Yammer
 
10 Yammer Group Suggestions
10 Yammer Group Suggestions10 Yammer Group Suggestions
10 Yammer Group Suggestions
 
Removing Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment SuccessRemoving Security Roadblocks to IoT Deployment Success
Removing Security Roadblocks to IoT Deployment Success
 
Building mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and XamarinBuilding mobile apps with Visual Studio and Xamarin
Building mobile apps with Visual Studio and Xamarin
 
Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...Best practices with Microsoft Graph: Making your applications more performant...
Best practices with Microsoft Graph: Making your applications more performant...
 
Interactive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive CardsInteractive emails in Outlook with Adaptive Cards
Interactive emails in Outlook with Adaptive Cards
 
Unlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph APIUnlocking security insights with Microsoft Graph API
Unlocking security insights with Microsoft Graph API
 
Break through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable FunctionsBreak through the serverless barriers with Durable Functions
Break through the serverless barriers with Durable Functions
 
Multiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container InstancesMultiplayer Server Scaling with Azure Container Instances
Multiplayer Server Scaling with Azure Container Instances
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
 
Media Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and XamarinMedia Streaming Apps with Azure and Xamarin
Media Streaming Apps with Azure and Xamarin
 
DevOps for Data Science
DevOps for Data ScienceDevOps for Data Science
DevOps for Data Science
 
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexityReal-World Solutions with PowerApps: Tips & tricks to manage your app complexity
Real-World Solutions with PowerApps: Tips & tricks to manage your app complexity
 
Azure Functions and Microsoft Graph
Azure Functions and Microsoft GraphAzure Functions and Microsoft Graph
Azure Functions and Microsoft Graph
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
 
Mobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing MapsMobile Workforce Location Tracking with Bing Maps
Mobile Workforce Location Tracking with Bing Maps
 
Cognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detectionCognitive Services Labs in action Anomaly detection
Cognitive Services Labs in action Anomaly detection
 
Speech Devices SDK
Speech Devices SDKSpeech Devices SDK
Speech Devices SDK
 

Dernier

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines.

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage, ADF, SQL DB, AAD) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs – 99.95%)
  • 7. An unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Spark SQL Interactive Queries Spark Structured Streaming Stream processing Spark MLlib Machine Learning Yarn Mesos Standalone Scheduler MLlib Machine Learning Streaming Stream processing GraphX Graph Computation
  • 8. Data Sources (Azure Storage, Cosmos DB, SQL) Cluster Manager Worker Node Worker Node Worker Node Driver Program SparkContext
  • 9.
  • 10.
  • 11.
  • 12. COMPLEX DATA Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, Azure Storage,Event Hubs, SQL DW, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning
  • 13. you should not have to reason about streaming
  • 14. you should write simple queries & Azure Databricks should continuously update the answer
  • 18. • High-Level APIs — DataFrames, Datasets and SQL. Same in streaming and in batch. • Event-time Processing — Native support for working with out-of-order and late data. • End-to-end Exactly Once — Transactional both in processing and output.
  • 20.
  • 21. eventsDF = spark.readStream .format("kafka") .option(“subscribe", “myTopic") .option("startingOffsets", "earliest") .load() ) Source • Specify readStream • Specify format • Specify one or more locations to read data from • Provide an offset • using union()
  • 23.
  • 24. DataFrames eventsDF .where("event_type = 'view'") .join(table("campaigns"), "ad_id") .groupBy( 'campaign_id) .count() Spark SQL SELECT campaign_id, count(*) FROM eventsDataFrame JOIN campaigns ON (campaigns.ad_id = events.ad_id) WHERE event_type = "view" GROUP BY "campaign_id"
  • 25.
  • 26. eventsDF .filter( clicks > 9999) .writeStream .foreach(new EmailAlert()) E.g. Trigger an Alert when clicks > 9999 Latency: ~100 ms Generate Email Alert
  • 28.
  • 29. filteredEvents .where("eventType = ‘view’") to_json filteredLogs "topic", “views" Forward filtered and augmented events back to Kafka Latency: ~100ms average Filter to_json() to convert columns back into json string, and then save as different Kafka topic
  • 30. ETL Store augmented stream as efficient columnar data for later processing Latency: ~1 minute .repartition(1) .trigger("1 minute") Buffer data and write one large file every minute for efficient reads
  • 33. parquet.`/data/events` Trouble shoot problems as they occur with latest information Latency: ~1 minute Ad-hoc Analysis will read latest data when query executed
  • 34.  Delta is a new Table Format that solves problems with Parquet  Delta brings data reliability and performance optimizations to the cloud data lake. ● Improves performance by organizing data and creating metadata - Manages optimal files sizes - Stores statistics that enable data skipping - Creates and maintains indexes ● Improves data reliability by adding data management capabilities - Transactional guarantees simplify data pipelines - Schema enforcement ensures data is not corrupted - Upserts makes fixing bad data simple
  • 35.
  • 36.
  • 38.
  • 39.
  • 40.
  • 41.  Unifies streaming, interactive and batch queries—a single API for both static bounded data and streaming unbounded data.  Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used for batch processing of static data.  Runs incrementally and continuously and updates the results as data streams in.  Supports app development in Scala, Java, Python and R.  Supports streaming aggregations, event-time windows, windowed grouped aggregation, stream-to-batch joins.  Features streaming deduplication, multiple output modes and APIs for managing/monitoring streaming queries.  Built-in sources: Kafka, File source (json, csv, text, parquet) A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
  • 42. spark.readStream .format("kafka") .option("subscribe", "input") .load() .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable. • Can include multiple sources of different types using union()
  • 43. spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Transformation • Using DataFrames, Datasets and/or SQL. • Catalyst figures out how to execute the transformation incrementally. • Internal processing always exactly-once.
  • 44. DataFrames, Datasets, SQL input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Write to Kafka Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Kafka Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata
  • 45. spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files). • Use foreach to execute arbitrary code.
  • 46. spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append – Output new rows only Trigger – When to output • Specified as a time, eventually supports data size • No trigger means as fast as possible
  • 47. spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Checkpoint • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure.
  • 48. Checkpointing tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log
  • 49. Specify options to configure How? kafka.boostrap.servers => broker1,broker2 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",... ) .option("subscribe", "topic") .load()
  • 50. rawData dataframe has the following columns val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",... ) .option("subscribe", "topic") .load() key value topic partitio n offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  • 51. Cast binary value to string Name it column json val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*")
  • 52. Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data val parsedData = rawData .selectExpr("cast (value as string) as json" .select(from_json("json", schema).as("data") .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  • 53. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json" .select(from_json("json", schema).as("data") .select("data.*") data (nested) timestam p device … 14860878 73 devA … 14860867 21 devX … timestam p device … 14860878 73 devA … 14860867 21 devX … select("data.*") (not nested)
  • 54. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in APIs to perform complex data transformations from_json, to_json, explode, ... 100s of functions (see our blog post)
  • 55. Writing to Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable")
  • 56. Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster val query = parsedData.writeStream .option("checkpointLocation", ... .format("parquet") .partitionBy("date") .start("/parquetTable/")
  • 57. Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution val query = parsedData.writeStream .option("checkpointLocation", ... .format("parquet") .partitionBy("date") .start("/parquetTable")/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery
  • 58. Data Consistency on Ad-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html complex, ad-hoc queries on latest data seconds!
  • 59. More Kafka Support [Spark 2.2] Write out to Kafka Dataframe must have binary fields named key and value Direct, interactive and batch queries on Kafka Makes Kafka even more powerful as a storage platform! result.writeStream .format("kafka") .option("topic", "output") .start() val df = spark .read // not readStream .format("kafka") .option("subscribe", "topic") .load() df.registerTempTable("topicData") spark.sql("select value from topicData")
  • 61. Event Time Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extracting event time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff
  • 62. Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal")avg signal strength of each device every 10 mins
  • 63. Stateful Processing for Aggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahea d log state updates are written to log for checkpointing state
  • 64. Automatically handles Late Data 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:0 0 14:0 0 15:0 0 16:0 0 17:0 0 Keeping state allows late data to update counts of old windows red = state updated with late data But size of the state increases indefinitely if old windows are not dropped
  • 65. Watermarking Watermark - moving threshold of how late data is expected to be and when to drop old state Trails behind max seen event time Trailing gap is configurable event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  • 66. Watermarking Data newer than watermark may be late, but allowed to aggregate Data older than watermark is "too late" and dropped Windows older than watermark automatically deleted to limit the amount of intermediate state max event time event time watermark late data allowed to aggregate data too late, dropped
  • 67. Watermarking max event time event time watermark allowed lateness of 10 mins parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowed to aggregate data too late, dropped Useful only in stateful operations (streaming aggs, dropDuplicates, mapGroupsWithState, ...) Ignored in non-stateful streaming queries and batch queries
  • 68. Watermarking data too late, ignored in counts, state dropped Processing Time12:0 0 12:0 5 12:1 0 12:1 5 12:10 12:15 12:20 12:0 7 12:1 3 12:0 8 EventTime 12:1 5 12:1 8 12:0 4 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() system tracks max observed event time 12:0 8 wm = 12:04 10 min 12:1 4 More details in my blog post
  • 69. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics Processing Details separated from
  • 70. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics How to group data by time? (same for batch & streaming) Processing Details
  • 71. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics How to group data by time? (same for batch & streaming) Processing Details How late can data be?
  • 72. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics How to group data by time? (same for batch & streaming) Processing Details How late can data be? How often to emit updates?
  • 73. Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 73 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }
  • 74. Other interesting operations Streaming Deduplication Watermarks to limit state Stream-batch Joins Stream-stream Joins Can use mapGroupsWithState Direct support oming soon! val batchData = spark.read .format("parquet") .load("/additional-data") parsedData.join(batchData, "device") parsedData.dropDuplicates("eventId")