SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Spark Streaming
The State of the Union and the Road Beyond
Tathagata “TD” Das
@tathadas
March 18, 2015
Who am I?
Project Management Committee (PMC) member of Spark
Lead developer of Spark Streaming
Formerly in AMPLab, UC Berkeley
Software developer at Databricks
What is Spark
Streaming?
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
How does it work?
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
5
data streams
receivers
batches results
Streaming Word Count with Kafka
val	
  kafka	
  =	
  KafkaUtils.create(ssc,	
  kafkaParams,	
  …)	
  
val	
  words	
  =	
  kafka.map(_._2).flatMap(_.split("	
  "))	
  
val	
  wordCounts	
  =	
  words.map(x	
  =>	
  (x,	
  1))	
  
	
   	
   	
   	
   	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  
wordCounts.print()	
  
ssc.start()	
  
6
print some counts on screen
count the words
split lines into words
create DStream
with lines from Kafka
start processing the stream
Languages
Can natively use
Can use any other language by using RDD.pipe()
7
Integrates with Spark Ecosystem
8
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine batch and streaming processing
Join data streams with static data sets
//	
  Create	
  data	
  set	
  from	
  Hadoop	
  file	
  
val	
  dataset	
  =	
  sparkContext.hadoopFile(“file”)	
  
	
  	
  	
  	
  	
  
//	
  Join	
  each	
  batch	
  in	
  stream	
  with	
  the	
  dataset	
  
kafkaStream.transform	
  {	
  batchRDD	
  =>	
  	
  
	
  	
  	
  	
  	
  	
  batchRDD.join(dataset)filter(...)	
  
}	
  
9
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine machine learning with streaming
Learn models offline, apply them online
//	
  Learn	
  model	
  offline	
  
val	
  model	
  =	
  KMeans.train(dataset,	
  ...)	
  
	
  
//	
  Apply	
  model	
  online	
  on	
  stream	
  
kafkaStream.map	
  {	
  event	
  =>	
  	
  
	
  	
  	
  	
  model.predict(event.feature)	
  	
  
}	
  
	
  
10
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Combine SQL with streaming
Interactively query streaming data with SQL
//	
  Register	
  each	
  batch	
  in	
  stream	
  as	
  table	
  
kafkaStream.map	
  {	
  batchRDD	
  =>	
  	
  
	
  	
  	
  	
  	
  batchRDD.registerTempTable("latestEvents")	
  
}	
  
	
  
//	
  Interactively	
  query	
  table	
  
sqlContext.sql("select	
  *	
  from	
  latestEvents")	
  
11
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
A Brief History
12
Late 2011 – research idea
AMPLab, UC Berkeley
We need to
make Spark
faster
Okay...umm,
how??!?!
A Brief History
13
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Late 2011 – idea
AMPLab, UC Berkeley
A Brief History
14
Late 2011 – idea
AMPLab, UC Berkeley
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Jan 2014 – Stable release
Graduation with Spark 0.9
Current state of
Spark Streaming
Adoption
16
Roadmap
Development
17
What have we added
in the last year?
Python API
Core functionality in Spark 1.2,
with sockets and files as
sources
Kafka support in Spark 1.3
Other sources coming in future
18
kafka	
  =	
  KafkaUtils.createStream(ssc,	
  params,	
  …)	
  
lines	
  =	
  kafka.map(lambda	
  x:	
  x[1])	
  
counts	
  =	
  lines.flatMap(lambda	
  line:	
  line.split("	
  "))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(lambda	
  word:	
  (word,	
  1))	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(lambda	
  a,	
  b:	
  a+b)	
  
counts.pprint()	
  
Streaming MLlib algorithms
val	
  model	
  =	
  new	
  StreamingKMeans()	
  
	
  	
  .setK(10)	
  
	
  	
  .setDecayFactor(1.0)	
  
	
  	
  .setRandomCenters(4,	
  0.0)	
  
	
  
//	
  Apply	
  model	
  to	
  DStreams	
  
model.trainOn(trainingDStream)	
  
model.predictOnValues(	
  
	
  	
  testDStream.map	
  {	
  lp	
  =>	
  	
  
	
  	
  	
  	
  (lp.label,	
  lp.features)	
  	
  
	
  	
  }	
  
).print()	
  
	
  
19
Continuous learning and prediction on
streaming data
StreamingLinearRegression in Spark 1.1
StreamingKMeans in Spark 1.2
StreamingLogisticRegression in Spark 1.3
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
Kafka `Direct` Stream API
Earlier Receiver-based approach for Kafka
Requires replicated journals (write ahead logs) to ensure
zero data loss under driver failures
20
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Kafka Receiver
high-level
consumer
Kafka `Direct` Stream API
Earlier Receiver-based approach for Kafka
New direct approach for Kafka in Spark 1.3
21
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Kafka Receiver
high-level
consumer
simple consumer API to
read Kafka topics
Kafka `Direct` Stream API
New direct approach for Kafka in 1.3 – treat Kafka like a file system
No receivers!!!
Directly query Kafka for latest topic offsets, and read data like reading files
Instead of Zookeeper, Spark Streaming keeps track of Kafka offsets
More efficient, fault-tolerant, exactly-once receiving of Kafka data
22
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Other Library Additions
Amazon Kinesis integration [Spark 1.1]
More fault-tolerant Flume integration [Spark 1.1]
23
System Infrastructure
Automated driver fault-tolerance [Spark 1.0]
Graceful shutdown [Spark 1.0]
Write Ahead Logs for zero data loss [Spark 1.2]
24
Contributors to Streaming
25
0
10
20
30
40
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
Contributors - Full Picture
26
0
30
60
90
120
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
Streaming
Core + Streaming
(w/o SQL, MLlib,…)
All contributions
to core Spark
directly improve
Spark Streaming
Spark Packages
More contributions from the
community in spark-packages
Alternate Kafka receiver
Apache Camel receiver
Cassandra examples
http://spark-packages.org/
27
Who is using
Spark Streaming?
Spark Summit 2014 Survey
29
40% of Spark users were
using Spark Streaming in
production or prototyping
Another 39% were
evaluating it
Not using
21%
Evaluating
39%
Prototyping
31%
Production
9%
30
31
80+
known
deployments
Intel China builds big data solutions for large enterprises
Multiple streaming applications for top businesses
Real-time risk analysis for a top online payment company
Real-time deal and flow metric reporting for a top online shopping company
Complicated stream processing
SQL queries on streams
Join streams with large historical datasets
> 1TB/day passing through Spark Streaming
YARN
Spark
Streaming
Kafka
RocketMQ
HBase
One of the largest publishing and education company, wants
to accelerate their push into digital learning
Needed to combine student activities and domain events to
continuously update the learning model of each student
Earlier implementation in Storm, but now moved on to
Spark Streaming
Spark Standalone
Spark
StreamingKafka
Cassandra
Chose Spark Streaming, because Spark together combines
batch, streaming, machine learning, and graph processing
Apache Blur
More information: http://dbricks.co/1BnFZZ8
Leading advertising automation company with an exchange
platform for in-feed ads
Process clickstream data for optimizing real-time bidding for ads
Mesos+Marathon
Spark
Streaming
Kinesis MySQL
Redis
RabbitMQ SQS
Wants to learn trending movies and shows in real time
Currently in the middle of replacing one of their internal
stream processing architecture with Spark Streaming
Tested resiliency of Spark Streaming with Chaos Monkey
More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
Driver failures handled with Spark
Standalone cluster’s supervise mode
Worker, executor and receiver failures
automatically handled
Spark Streaming can handle all kinds of failures
More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
Neuroscience @ Freeman Lab, Janelia Farm
Spark Streaming and MLlib to
analyze neural activities
Laser microscope scans Zebrafish
brainà Spark Streaming à
interactive visualization à
laser ZAP to kill neurons!
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning
algorithms on time series data of
every neuron
Upto 2TB/hour and increasing with
brain size
Upto 80 HPC nodes
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
Why are they adopting Spark Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
41
What’s coming next?
Libraries
Operational Ease
Performance
Roadmap
Libraries
Streaming machine learning algorithms
A/B testing
Online Latent Dirichlet Allocation (LDA)
More streaming linear algorithms
Streaming + DataFrames, Streaming + SQL
44
Roadmap
Operational Ease
Better flow control
Elastic scaling
Cross-version upgradability
Improved support for non-Hadoop environments
45
Roadmap
Performance
Higher throughput, especially of stateful operations
Lower latencies
Easy deployment of streaming apps in Databricks Cloud!
46
You can help!
Roadmaps are heavily driven by community feedback
We have listened to community demands over the last year
Write Ahead Logs for zero data loss
New Kafka direct API
Let us know what do you want to see in Spark Streaming
Spark user mailing list, tweet it to me @tathadas
47
Industry adoption increasing rapidly
Community contributing very actively
More libraries, operational ease and
performance in the roadmap
48
@tathadas
49
Backup slides
Typesafe survey of Spark users
2136 developers, data scientists,
and other tech professionals
http://java.dzone.com/articles/apache-spark-survey-typesafe-0
Typesafe survey of Spark users
65% of Spark users are interested
in Spark Streaming
Typesafe survey of Spark users
2/3 of Spark users want to process
event streams
53
More usecases
•  Big data solution provider for enterprises
•  Multiple applications for different businesses
-  Monitoring +optimizing online services of Tier-1 bank
-  Fraudulent transaction detection for Tier-2 bank
•  Kafka à SS à Cassandra, MongoDB
•  Built their own Stratio Streaming platform on
Spark Streaming, Kafka, Cassandra, MongoDB
•  Provides data analytics solutions for Communication
Service Providers
-  4 of 5 top mobile ops, 3 of 4 top internet backbone providers
-  Processes >50% of all US mobile traffic
•  Multiple applications for different businesses
-  Real-time anomaly detection in cell tower traffic
-  Real-time call quality optimizations
•  Kafka à SS
http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
•  Runs claims processing applications for healthcare providers
http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims
•  Predictive models can look
for claims that are likely to
be held up for approval
•  Spark Streaming allows
model scoring in seconds
instead of hours

Contenu connexe

Tendances

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks
 

Tendances (20)

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 

En vedette

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBaseCarol McDonald
 
Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)
Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)
Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)Robert Scholte
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Scala overview
Scala overviewScala overview
Scala overviewSteve Min
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 

En vedette (20)

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
VS_TextByTheBay
VS_TextByTheBayVS_TextByTheBay
VS_TextByTheBay
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)
Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)
Java 9 and the impact on Maven Projects (ApacheCon Europe 2016)
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Scala overview
Scala overviewScala overview
Scala overview
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 

Similaire à Spark streaming state of the union

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 

Similaire à Spark streaming state of the union (20)

Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 

Dernier (20)

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 

Spark streaming state of the union

  • 1. Spark Streaming The State of the Union and the Road Beyond Tathagata “TD” Das @tathadas March 18, 2015
  • 2. Who am I? Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks
  • 4. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX
  • 5. How does it work? Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 5 data streams receivers batches results
  • 6. Streaming Word Count with Kafka val  kafka  =  KafkaUtils.create(ssc,  kafkaParams,  …)   val  words  =  kafka.map(_._2).flatMap(_.split("  "))   val  wordCounts  =  words.map(x  =>  (x,  1))                .reduceByKey(_  +  _)   wordCounts.print()   ssc.start()   6 print some counts on screen count the words split lines into words create DStream with lines from Kafka start processing the stream
  • 7. Languages Can natively use Can use any other language by using RDD.pipe() 7
  • 8. Integrates with Spark Ecosystem 8 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 9. Combine batch and streaming processing Join data streams with static data sets //  Create  data  set  from  Hadoop  file   val  dataset  =  sparkContext.hadoopFile(“file”)             //  Join  each  batch  in  stream  with  the  dataset   kafkaStream.transform  {  batchRDD  =>                batchRDD.join(dataset)filter(...)   }   9 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 10. Combine machine learning with streaming Learn models offline, apply them online //  Learn  model  offline   val  model  =  KMeans.train(dataset,  ...)     //  Apply  model  online  on  stream   kafkaStream.map  {  event  =>            model.predict(event.feature)     }     10 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 11. Combine SQL with streaming Interactively query streaming data with SQL //  Register  each  batch  in  stream  as  table   kafkaStream.map  {  batchRDD  =>              batchRDD.registerTempTable("latestEvents")   }     //  Interactively  query  table   sqlContext.sql("select  *  from  latestEvents")   11 Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 12. A Brief History 12 Late 2011 – research idea AMPLab, UC Berkeley We need to make Spark faster Okay...umm, how??!?!
  • 13. A Brief History 13 Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms Q3 2012 Spark core improvements open sourced in Spark 0.6 Feb 2013 – Alpha release 7.7k lines, merged in 7 days Released with Spark 0.7 Late 2011 – idea AMPLab, UC Berkeley
  • 14. A Brief History 14 Late 2011 – idea AMPLab, UC Berkeley Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms Q3 2012 Spark core improvements open sourced in Spark 0.6 Feb 2013 – Alpha release 7.7k lines, merged in 7 days Released with Spark 0.7 Jan 2014 – Stable release Graduation with Spark 0.9
  • 17. 17 What have we added in the last year?
  • 18. Python API Core functionality in Spark 1.2, with sockets and files as sources Kafka support in Spark 1.3 Other sources coming in future 18 kafka  =  KafkaUtils.createStream(ssc,  params,  …)   lines  =  kafka.map(lambda  x:  x[1])   counts  =  lines.flatMap(lambda  line:  line.split("  "))                                      .map(lambda  word:  (word,  1))                                        .reduceByKey(lambda  a,  b:  a+b)   counts.pprint()  
  • 19. Streaming MLlib algorithms val  model  =  new  StreamingKMeans()      .setK(10)      .setDecayFactor(1.0)      .setRandomCenters(4,  0.0)     //  Apply  model  to  DStreams   model.trainOn(trainingDStream)   model.predictOnValues(      testDStream.map  {  lp  =>            (lp.label,  lp.features)        }   ).print()     19 Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1 StreamingKMeans in Spark 1.2 StreamingLogisticRegression in Spark 1.3 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
  • 20. Kafka `Direct` Stream API Earlier Receiver-based approach for Kafka Requires replicated journals (write ahead logs) to ensure zero data loss under driver failures 20 http://spark.apache.org/docs/latest/streaming-kafka-integration.html Kafka Receiver high-level consumer
  • 21. Kafka `Direct` Stream API Earlier Receiver-based approach for Kafka New direct approach for Kafka in Spark 1.3 21 http://spark.apache.org/docs/latest/streaming-kafka-integration.html Kafka Receiver high-level consumer simple consumer API to read Kafka topics
  • 22. Kafka `Direct` Stream API New direct approach for Kafka in 1.3 – treat Kafka like a file system No receivers!!! Directly query Kafka for latest topic offsets, and read data like reading files Instead of Zookeeper, Spark Streaming keeps track of Kafka offsets More efficient, fault-tolerant, exactly-once receiving of Kafka data 22 http://spark.apache.org/docs/latest/streaming-kafka-integration.html
  • 23. Other Library Additions Amazon Kinesis integration [Spark 1.1] More fault-tolerant Flume integration [Spark 1.1] 23
  • 24. System Infrastructure Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2] 24
  • 25. Contributors to Streaming 25 0 10 20 30 40 Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
  • 26. Contributors - Full Picture 26 0 30 60 90 120 Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2 Streaming Core + Streaming (w/o SQL, MLlib,…) All contributions to core Spark directly improve Spark Streaming
  • 27. Spark Packages More contributions from the community in spark-packages Alternate Kafka receiver Apache Camel receiver Cassandra examples http://spark-packages.org/ 27
  • 28. Who is using Spark Streaming?
  • 29. Spark Summit 2014 Survey 29 40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it Not using 21% Evaluating 39% Prototyping 31% Production 9%
  • 30. 30
  • 32. Intel China builds big data solutions for large enterprises Multiple streaming applications for top businesses Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company
  • 33. Complicated stream processing SQL queries on streams Join streams with large historical datasets > 1TB/day passing through Spark Streaming YARN Spark Streaming Kafka RocketMQ HBase
  • 34. One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming
  • 35. Spark Standalone Spark StreamingKafka Cassandra Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing Apache Blur More information: http://dbricks.co/1BnFZZ8
  • 36. Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads Mesos+Marathon Spark Streaming Kinesis MySQL Redis RabbitMQ SQS
  • 37. Wants to learn trending movies and shows in real time Currently in the middle of replacing one of their internal stream processing architecture with Spark Streaming Tested resiliency of Spark Streaming with Chaos Monkey More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
  • 38. Driver failures handled with Spark Standalone cluster’s supervise mode Worker, executor and receiver failures automatically handled Spark Streaming can handle all kinds of failures More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
  • 39. Neuroscience @ Freeman Lab, Janelia Farm Spark Streaming and MLlib to analyze neural activities Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons! http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
  • 40. Neuroscience @ Freeman Lab, Janelia Farm Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
  • 41. Why are they adopting Spark Streaming? Easy, high-level API Unified API across batch and streaming Integration with Spark SQL and MLlib Ease of operations 41
  • 44. Roadmap Libraries Streaming machine learning algorithms A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms Streaming + DataFrames, Streaming + SQL 44
  • 45. Roadmap Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments 45
  • 46. Roadmap Performance Higher throughput, especially of stateful operations Lower latencies Easy deployment of streaming apps in Databricks Cloud! 46
  • 47. You can help! Roadmaps are heavily driven by community feedback We have listened to community demands over the last year Write Ahead Logs for zero data loss New Kafka direct API Let us know what do you want to see in Spark Streaming Spark user mailing list, tweet it to me @tathadas 47
  • 48. Industry adoption increasing rapidly Community contributing very actively More libraries, operational ease and performance in the roadmap 48 @tathadas
  • 50. Typesafe survey of Spark users 2136 developers, data scientists, and other tech professionals http://java.dzone.com/articles/apache-spark-survey-typesafe-0
  • 51. Typesafe survey of Spark users 65% of Spark users are interested in Spark Streaming
  • 52. Typesafe survey of Spark users 2/3 of Spark users want to process event streams
  • 54. •  Big data solution provider for enterprises •  Multiple applications for different businesses -  Monitoring +optimizing online services of Tier-1 bank -  Fraudulent transaction detection for Tier-2 bank •  Kafka à SS à Cassandra, MongoDB •  Built their own Stratio Streaming platform on Spark Streaming, Kafka, Cassandra, MongoDB
  • 55. •  Provides data analytics solutions for Communication Service Providers -  4 of 5 top mobile ops, 3 of 4 top internet backbone providers -  Processes >50% of all US mobile traffic •  Multiple applications for different businesses -  Real-time anomaly detection in cell tower traffic -  Real-time call quality optimizations •  Kafka à SS http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
  • 56. •  Runs claims processing applications for healthcare providers http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims •  Predictive models can look for claims that are likely to be held up for approval •  Spark Streaming allows model scoring in seconds instead of hours