SlideShare une entreprise Scribd logo
1  sur  27
Spark and Storm at Yahoo 
Wh y c h o o s e o n e o v e r t h e o t h e r ? 
P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
Tom Graves 
Bobby Evans (bobby@apache.org) 
2 
 Committers and PMC/PPMC Members for 
› Apache Storm incubating (Bobby) 
› Apache Hadoop (Tom and Bobby) 
› Apache Spark (Tom and Bobby) 
› Apache TEZ (Tom and Bobby) 
 Low Latency Big Data team at Yahoo (Part of the Hadoop Team) 
› Apache Storm as a service 
• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes). 
› Apache Spark on YARN 
• 40,000 nodes total, 5000+ node cluster 
› Help with distributed ML and deep learning.
Where we come from 
Yahoo Champaign: 
• 100+ engineers 
• Located in UIUC Research Park http://researchpark.illinois.edu/ 
• Split between Advertising and Data Platform team and Hadoop team. 
• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo. 
• Site is 7 years old, and we are building a new building with room for 200. 
• We are Hiring 
• resume-hadoop@yahoo-inc.com 
• http://bit.ly/1ybTXMe
Agenda 
Spark Overview (1.1) 
Storm Overview (0.9.2) 
Things to Consider 
Example Architectures 
4 Yahoo Confidential & Proprietary
Apache Spark 
5
Spark Key Concepts 
Write programs in terms of 
transformations on distributed 
Resilient Distributed 
Datasets 
 Collections of objects spread 
across a cluster, stored in RAM 
or on Disk 
 Built through parallel 
transformations 
 Automatically rebuilt on failure 
Operations 
 Transformations 
(e.g. map, filter, 
groupBy) 
 Actions 
(e.g. count, collect, 
save) 
datasets
Working With RDDs 
RDD 
RDD 
RDD 
RDD 
Transformations 
textFile = sc.textFile(”SomeFile.txt”) 
Action Value 
linesWithSpark = textFile.filter(lambda line: "Spark” in line) 
linesWithSpark.count() 
74 
linesWithSpark.first() 
# Apache Spark
Example: Word Count 
> lines = sc.textFile(“hamlet.txt”) 
> counts = lines.flatMap(lambda line: line.split(“ ”)) 
.map(lambda word => (word, 1)) 
.reduceByKey(lambda x, y: x + y) 
“to be or” 
“not to be” 
“to” 
“be” 
“or” 
“not” 
“to” 
“be” 
(to, 1) 
(be, 1) 
(or, 1) 
(not, 1) 
(to, 1) 
(be, 1) 
(be, 2) 
(not, 1) 
(or, 1) 
(to, 2)
Spark Streaming Word Count 
updateFunc = (values: Seq[Int], state: Option[Int]) => { 
val currentCount = values.foldLeft(0)(_ + _) 
val previousCount = state.getOrElse(0) 
Some(currentCount + previousCount) 
} 
… 
lines = ssc.socketTextStream(args(0), args(1).toInt) 
Words = lines.flatMap(lambda line: line.split(“ ”)) 
wordDstream = words.map(lambda word => (word, 1)) 
stateDstream = wordDstream.updateStateByKey[Int](updateFunc) 
ssc.start() 
ssc.awaitTermination()
10 
Apache Storm
Storm Concepts 
1. Streams 
› Unbounded sequence of tuples 
2. Spout 
› Source of Stream 
› E.g. Read from Twitter streaming API 
3. Bolts 
› Processes input streams and produces 
new streams 
› E.g. Functions, Filters, Aggregation, 
Joins 
4. Topologies 
› Network of spouts and bolts
Storm Architecture 
Master 
Node 
Cluster 
Coordination 
Worker 
Worker 
Worker 
Worker 
Processes 
Nimbus 
Zookeeper 
Zookeeper 
Zookeeper 
Supervisor 
Supervisor 
Supervisor 
Supervisor Worker 
Launches 
Workers
Trident (Storm) Word Count 
TridentTopology topology = new TridentTopology(); 
TridentState wordCounts = topology.newStream("spout1", spout) 
.each(new Fields("sentence"), new Split(), new Fields("word")) 
.groupBy(new Fields("word")) 
.persistentAggregate(new MemoryMapState.Factory(), new Count(), 
new Fields("count")).parallelismHint(6); 
“to be or” 
“to” 
“be” 
“or” 
(to, 1) 
(be, 1) 
(or, 1) 
1) 
1) 
“not to be” 
“not” 
“to” 
“be” 
(not, 1) 
(to, 1) 
(be, 1) 
(be, 2) 
(not, 1) 
(or, 1) 
(to, 2)
Use the Right Tool for the Job 
14 
https://www.flickr.com/photos/hikingartist/4193330368/
Things to Consider 
15 
Scale 
Latency 
 Iterative Processing 
› Are there suitable non-iterative alternatives? 
Use What You Know 
Code Reuse 
Maturity
When We Recommend Spark 
16 
 Iterative Batch Processing (most Machine Learning) 
› There really is nothing else right now. 
› Has some scale issues. 
 Tried ETL (Not at Yahoo scale yet) 
 Tried Shark/Interactive Queries (Not at Yahoo scale yet) 
 < 1 TB (or memory size of your cluster) 
 Tuning it to run well can be a pain 
 Data Bricks and others are working on scaling. 
 Streaming is all μ-batch so latency is at least 1 sec 
 Streaming has single points of failure still 
 All streaming inputs are replicated in memory
When We Recommend Storm 
17 
 Latency < 1 second (single event at a time) 
› There is little else (especially not open source) 
 “Real Time” … 
› Analytics 
› Budgeting 
› ML 
› Anything 
 Lower Level API than Spark 
 No built-in concept of look back aggregations 
 Takes more effort to combine batch with streaming
Fictitious Example: My Commute App 
18 
 Mobile App that lets users track their commute. 
 Cities, users, companies, etc. compete daily for 
› Shortest commute time 
› Greenest commute 
 Make money by selling location based ads and aggregate data to 
› Governments 
› Advertisers 
 Feel free to steal my crazy idea, I just want to be invited to the launch 
party, and I wouldn't say no to some stock.
Chicago vs. Champaign Urbana 
19 
Champaign Urbana: 14-15 min 
Chicago: 20-30 min 
35 
30 
25 
20 
15 
10 
5 
0 
Bobby 
CU Chicago 
Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
Things to Consider 
20 
Scale 
› everyone in the world!!! 
Latency 
› a few seconds max 
 Iterative Processing 
› Possibly for targeting, but there are alternatives
Architecture 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
Kafka Storm 
HBase/NoSQ 
L 
HDFS Spark 
Customer 
21
Architecture (Alternative) 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
HBase/NOS 
QL 
HDFS Spark 
Customer 
22 
Go directly to Spark Streaming, 
but data loss potential goes up.
Architecture (Alternative 2) 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
Kafka Storm 
HBase/NOS 
QL 
Customer 
23 
Streaming Operations Only 
(Kappa Architecture)
Fictitious Example 2: Web Scale Monitoring 
24 
 Look for trends that can indicate a problem. 
› Alert or provide automated corrections 
 Provide an interface to visualize 
› Current data very quickly 
› Historical data in depth 
 If you commercialize this one please give me/Yahoo a free license for 
life (open source works too)
Things to Consider 
25 
Scale 
› Lots of events from many different servers 
Latency 
› a few seconds max, but the fewer the better 
 Iterative Processing 
› For in depth analysis definetly
Fictitious Example 2: Web Scale Monitoring 
26 
Servers 
HBase 
Kafka Storm 
HDFS Spark 
UI 
Alert!! 
JDBC 
Server 
Rules 
ML and trend 
analysis
Questions? 
bobby@apache.org resume-hadoop@yahoo-inc.com 
http://bit.ly/1ybTXMe

Contenu connexe

Tendances

Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBComparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBScyllaDB
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04Carlos Sierra
 

Tendances (20)

Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBComparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Adapting and adopting spm v04
Adapting and adopting spm v04Adapting and adopting spm v04
Adapting and adopting spm v04
 

En vedette

Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Safety hand tools & grinding
Safety hand tools & grindingSafety hand tools & grinding
Safety hand tools & grindingLavanya Singh
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersJean-Paul Azar
 

En vedette (9)

Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Safety hand tools & grinding
Safety hand tools & grindingSafety hand tools & grinding
Safety hand tools & grinding
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka Consumers
 

Similaire à Yahoo compares Storm and Spark

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...DataStax Academy
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 

Similaire à Yahoo compares Storm and Spark (20)

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Tech
TechTech
Tech
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 

Plus de Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

Plus de Chicago Hadoop Users Group (19)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Dernier

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Dernier (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Yahoo compares Storm and Spark

  • 1. Spark and Storm at Yahoo Wh y c h o o s e o n e o v e r t h e o t h e r ? P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
  • 2. Tom Graves Bobby Evans (bobby@apache.org) 2  Committers and PMC/PPMC Members for › Apache Storm incubating (Bobby) › Apache Hadoop (Tom and Bobby) › Apache Spark (Tom and Bobby) › Apache TEZ (Tom and Bobby)  Low Latency Big Data team at Yahoo (Part of the Hadoop Team) › Apache Storm as a service • 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes). › Apache Spark on YARN • 40,000 nodes total, 5000+ node cluster › Help with distributed ML and deep learning.
  • 3. Where we come from Yahoo Champaign: • 100+ engineers • Located in UIUC Research Park http://researchpark.illinois.edu/ • Split between Advertising and Data Platform team and Hadoop team. • Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo. • Site is 7 years old, and we are building a new building with room for 200. • We are Hiring • resume-hadoop@yahoo-inc.com • http://bit.ly/1ybTXMe
  • 4. Agenda Spark Overview (1.1) Storm Overview (0.9.2) Things to Consider Example Architectures 4 Yahoo Confidential & Proprietary
  • 6. Spark Key Concepts Write programs in terms of transformations on distributed Resilient Distributed Datasets  Collections of objects spread across a cluster, stored in RAM or on Disk  Built through parallel transformations  Automatically rebuilt on failure Operations  Transformations (e.g. map, filter, groupBy)  Actions (e.g. count, collect, save) datasets
  • 7. Working With RDDs RDD RDD RDD RDD Transformations textFile = sc.textFile(”SomeFile.txt”) Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark
  • 8. Example: Word Count > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 9. Spark Streaming Word Count updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } … lines = ssc.socketTextStream(args(0), args(1).toInt) Words = lines.flatMap(lambda line: line.split(“ ”)) wordDstream = words.map(lambda word => (word, 1)) stateDstream = wordDstream.updateStateByKey[Int](updateFunc) ssc.start() ssc.awaitTermination()
  • 11. Storm Concepts 1. Streams › Unbounded sequence of tuples 2. Spout › Source of Stream › E.g. Read from Twitter streaming API 3. Bolts › Processes input streams and produces new streams › E.g. Functions, Filters, Aggregation, Joins 4. Topologies › Network of spouts and bolts
  • 12. Storm Architecture Master Node Cluster Coordination Worker Worker Worker Worker Processes Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Worker Launches Workers
  • 13. Trident (Storm) Word Count TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(6); “to be or” “to” “be” “or” (to, 1) (be, 1) (or, 1) 1) 1) “not to be” “not” “to” “be” (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 14. Use the Right Tool for the Job 14 https://www.flickr.com/photos/hikingartist/4193330368/
  • 15. Things to Consider 15 Scale Latency  Iterative Processing › Are there suitable non-iterative alternatives? Use What You Know Code Reuse Maturity
  • 16. When We Recommend Spark 16  Iterative Batch Processing (most Machine Learning) › There really is nothing else right now. › Has some scale issues.  Tried ETL (Not at Yahoo scale yet)  Tried Shark/Interactive Queries (Not at Yahoo scale yet)  < 1 TB (or memory size of your cluster)  Tuning it to run well can be a pain  Data Bricks and others are working on scaling.  Streaming is all μ-batch so latency is at least 1 sec  Streaming has single points of failure still  All streaming inputs are replicated in memory
  • 17. When We Recommend Storm 17  Latency < 1 second (single event at a time) › There is little else (especially not open source)  “Real Time” … › Analytics › Budgeting › ML › Anything  Lower Level API than Spark  No built-in concept of look back aggregations  Takes more effort to combine batch with streaming
  • 18. Fictitious Example: My Commute App 18  Mobile App that lets users track their commute.  Cities, users, companies, etc. compete daily for › Shortest commute time › Greenest commute  Make money by selling location based ads and aggregate data to › Governments › Advertisers  Feel free to steal my crazy idea, I just want to be invited to the launch party, and I wouldn't say no to some stock.
  • 19. Chicago vs. Champaign Urbana 19 Champaign Urbana: 14-15 min Chicago: 20-30 min 35 30 25 20 15 10 5 0 Bobby CU Chicago Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
  • 20. Things to Consider 20 Scale › everyone in the world!!! Latency › a few seconds max  Iterative Processing › Possibly for targeting, but there are alternatives
  • 21. Architecture App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NoSQ L HDFS Spark Customer 21
  • 22. Architecture (Alternative) App Web Service (User, Commute ID, Location History, MPG) HBase/NOS QL HDFS Spark Customer 22 Go directly to Spark Streaming, but data loss potential goes up.
  • 23. Architecture (Alternative 2) App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NOS QL Customer 23 Streaming Operations Only (Kappa Architecture)
  • 24. Fictitious Example 2: Web Scale Monitoring 24  Look for trends that can indicate a problem. › Alert or provide automated corrections  Provide an interface to visualize › Current data very quickly › Historical data in depth  If you commercialize this one please give me/Yahoo a free license for life (open source works too)
  • 25. Things to Consider 25 Scale › Lots of events from many different servers Latency › a few seconds max, but the fewer the better  Iterative Processing › For in depth analysis definetly
  • 26. Fictitious Example 2: Web Scale Monitoring 26 Servers HBase Kafka Storm HDFS Spark UI Alert!! JDBC Server Rules ML and trend analysis

Notes de l'éditeur

  1. RDD  Colloquially referred to as RDDs (e.g. caching in RAM) Lazy operations to build RDDs from other RDDs Return a result or write it to storage
  2. Let me illustrate this with some bad powerpoint diagrams and animations This diagram is LOGICAL,
  3. Trend analysis is difficult but sketches for approximations on many aggregates and Gradient Decent or VW for ML make this still an attractive option.