SlideShare une entreprise Scribd logo
1  sur  48
What's New in the
Berkeley Data Analytics
Stack
Tathagata Das, Reynold Xin (AMPLab, UC
Berkeley)
Hadoop Summit 2013 UC BERKELEY
Berkeley Data Analytics
Stack
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos / YARN Resource Manager
Spark
Streaming
GraphX MLBase
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos / YARN Resource Manager
Spark
Streaming
GraphX MLBase
Project History
2010: Spark (core execution engine) open
sourced
2012: Shark open sourced
Feb 2013: Spark Streaming alpha open
sourced
Jun 2013: Spark entered Apache Incubator
Community
3000+ people online training
800+ meetup members
60+ developers contributing
17 companies contributing
Hadoop and continuous computing: looking beyond
MapReduce
Bruno Fernandez-Ruiz, Senior Fellow & VP
2012 Hadoop Summit
2012 Hadoop Summit (Future of Apache Hadoop)
2012 Hadoop Summit (Future of Apache Hadoop)
2013 Hadoop Summit
2012 Hadoop Summit (Future of Apache Hadoop)
2013 Hadoop Summit (Hadoop Economics)
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Spark
Fast and expressive cluster computing
system interoperable with Apache Hadoop
Improves efficiency through:
»In-memory computing primitives
»General computation graphs
Improves usability through:
»Rich APIs in Scala, Java, Python
»Interactive shell
Up to 100× faster
(2-10× on disk)
Often 5× less code
Why a New Framework?
MapReduce greatly simplified big data
analysis
But as soon as it got popular, users wanted
more:
»More complex, multi-pass analytics (e.g. ML,
graph)
»More interactive ad-hoc queries
»More real-time stream processing
Spark Programming Model
Key idea: resilient distributed datasets
(RDDs)
»Distributed collections of objects
»Can optionally be cached in memory across
cluster
»Manipulated through parallel operators
»Automatically recomputed on failure
Programming interface
»Functional APIs in Scala, Java, Python
»Interactive use from Scala and Python shell
Example: Log Mining
Exposes RDDs through a functional API in
Java, Python, Scala
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
errors.persist()
Block 1
Block 2
Block 3
Worke
r
errors.filter(_.contains(“foo”)).count()
errors.filter(_.contains(“bar”)).count()
tasks
results
Errors 2
Base RDD
Transformed RDD
Action
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Result: 1 TB data in 5 sec
(vs 170 sec for on-disk data)
Worke
r
Errors 3
Worke
r
Errors 1
Master
Spark: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
Machine Learning
Algorithms
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering Hadoop
MR
Time per Iteration (s)
Spark in Java and Python
Python API
lines = spark.textFile(…)
errors = lines.filter(
lambda s: "ERROR" in s)
errors.count()
Java API
JavaRDD<String> lines =
spark.textFile(…);
errors = lines.filter(
new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("ERROR");
}
});
errors.count()
Projects Building on Spark
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
GraphX
Combining data-parallel and graph-parallel
»Run graph analytics and ETL in the same engine
»Consume graph computation output in Spark
»Interactive shell
Programmability
»Support GraphLab / Pregel APIs in 20 LOC
»Implement PageRank in 5 LOC
Coming this summer as a Spark module
Scalable Machine Learning
Build a Classifier
for X
What you want to
do
What you have to
do• Learn the internals of ML
classification
algorithms, sampling, featur
e selection, X-validation,….
• Potentially learn
Spark/Hadoop/…
• Implement 3-4 algorithms
• Implement grid-search to
find the right algorithm
parameters
• Implement validation
algorithms
• Experiment with different
sampling-
sizes, algorithms, features
• ….
and in the end
Ask For Help
21
MLBase
Making large scale machine learning easy
»User specifies the task (e.g. “classify this
dataset”)
»MLBase picks the best algorithm and best
parameters for the task
Develop scalable, high-quality ML algorithms
»Naïve Bayes
»Logistic/Least Squares Regression (L1/L2
Regularization)
»Matrix Factorization (ALS, CCD)
»K-Means & DP-Means
First release (summer): collection of scalable
algorithms
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Shark
Hive compatible: HiveQL, UDFs, metadata,
etc.
»Works in existing Hive warehouses without
changing queries or data!
Fast execution engine
»Uses Spark as the underlying execution engine
»Low-latency, interactive queries
»Scales out and tolerate worker failures
Easy to combine with Spark
»Process data with SQL queries as well as raw
Real-world Performance
0
25
50
75
100
Q1 Q2 Q3 Q4
Runtime(seconds)
Shark Shark (disk) Hive
1.1 0.8 0.7 1.0
1.7 TB Real Warehouse Data on 100 EC2 nodes
Comparison
Impala
Impala (mem)
Redshift
Shark (disk)
Shark (mem)
0 5 10 15 20
Runtime (seconds)
http://tinyurl.com/bigdata-
benchmark
Today’s Talk
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
Spark Streaming
Extends Spark for large scale stream
processing
»Receive data directly from Kafka, Flume, Twitter,
etc.
»Fast, scalable, and fault-tolerant
Simple, yet rich batch-like API
»Easy to express your complex streaming
computation
»Fault-tolerant, stateful stream processing out of
Motivation
Many important applications must process large
streams of live data and provide results in near-
real-time
» Social network trends
» Website statistics
» Intrusion detection systems
» Etc.
Challenges
Require large clusters
Require latencies of few seconds
Require fault-tolerance
Require integration with batch processing
Integration with Batch
Processing
Many environments require processing same data
in live streaming as well as batch post-processing
Hard for any existing single framework to achieve
both
» Provide low latency for streaming workloads
» Handle large volumes of data for batch workloads
Extremely painful to maintain two stacks
» Different programming models
» Double the implementation effort
» Double the number of bugs
Existing Streaming Systems
Storm – Limited fault-tolerance guarantee
»Replays records if not processed
»Processes each record at least once
»May double count events!
»Mutable state can be lost due to failure!
Trident – Use transactions to update state
»Processes each record exactly once
»Per state transaction to external database is slow
Neither integrate well with batch processing
systems
Spark Streaming
• Chop up the live stream into
batches of X seconds
• Spark treats each batch of data
as RDDs and processes them
using RDD operations
• Finally, the processed results
of the RDD operations are
returned in batches
Spark
Spark
Streamin
g
batches of X
seconds
live data stream
processed
results
Discretized Stream Processing - run a streaming
computation as a series of very small, deterministic
batch jobs
Spark Streaming
Discretized Stream Processing - run a streaming
computation as a series of very small, deterministic
batch jobs
• Batch sizes as low as ½
second, latency ~ 1
second
• Potential for combining
batch processing and
streaming processing in
the same system
Spark
Spark
Streamin
g
batches of X
seconds
live data stream
processed
results
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
DStream: a sequence of RDDs representing a stream
of data
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
stored in memory as an
RDD
(immutable, distributed)
Twitter Streaming API
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
Twitter Streaming API
transformation: modify data in one DStream to create
another DStream
new DStream
flatMap flatMap flatMap
…
new RDDs created
for every batch
hashTags
Dstream
[#cat, #dog, … ]
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
DStream
every batch
saved to HDFS
Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.foreach(hashTagRDD => { … })
foreach: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
DStream
Write to database, update
analytics UI, do whatever you
want
Window-based
Transformations
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
DStream of data
sliding window
operation
window length sliding interval
window length
sliding interval
Arbitrary Stateful
Computations
Specify function to generate new state based on
previous state and new data
» Example: Maintain per-user mood as state, and update it
with their tweets
updateMood(newTweets, lastMood) => newMood
moods = tweets.updateStateByKey(tweets => updateMood(tweets))
» Exactly-once semantics even under worker failures
Arbitrary Combination of Batch
and Streaming Computations
Inter-mix RDD and DStream operations!
» Example: Join incoming tweets with a spam HDFS file
to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
DStream Input Sources
Out of the box we provide
»Kafka
»Twitter
»HDFS
»Flume
»Raw TCP sockets
Very simple API to write a receiver for your
own data source!
Performance
Can process 6 GB/sec (60M records/sec) of data
on 100 nodes at sub-second latency
» Tested with 100 text streams on 100 EC2 instances with 4 cores
each
0
1
2
3
4
5
6
7
0 50 100
ClusterThhroughput
(GB/s)
# Nodes in Cluster
Grep
1 sec
2 sec
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100
ClusterThroughput(GB/s)
# Nodes in Cluster
WordCount
1 sec
2 sec
High Throughput
and
Low Latency
Comparison with Storm
Higher throughput than Storm
»Spark Streaming: 670k records/second/node
»Storm: 115k records/second/node
0
40
80
120
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Stor
m
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
Fast Fault Recovery
Recovers from faults/stragglers within 1 sec
Real Applications: Traffic
Sensing
Traffic transit time estimation using online machine
learning on GPS observations
• Markov chain Monte Carlo
simulations on GPS
observations
• Very CPU
intensive, requires dozens
of machines for useful
computation
• Scales linearly with cluster
size
0
400
800
1200
1600
2000
0 20 40 60 80
GPSobservationspersecond
# Nodes in Cluster
Unifying Batch and Stream
Models
Spark program on Twitter log file using RDDs
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Spark Streaming program on Twitter stream using
DStreams
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Same code base works for both batch
processing and stream processing
Conclusion
Berkeley Data Analytics Stack
»Next generation of data analytics stack with
speed and functionality
More information: www.spark-project.org
Hands-on Tutorials:
ampcamp.berkeley.edu
»Video tutorials, EC2 exercises
»AMP Camp 2 – August 29-30, 2013

Contenu connexe

En vedette

評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例JubatusOfficial
 
jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証odasatoshi
 
標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例JubatusOfficial
 
Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介JubatusOfficial
 
Jubatusでuserとbrandのレコメンドを試してみた話
Jubatusでuserとbrandのレコメンドを試してみた話Jubatusでuserとbrandのレコメンドを試してみた話
Jubatusでuserとbrandのレコメンドを試してみた話JubatusOfficial
 
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類Hirotaka Ogawa
 
Jubatusで始める機械学習
Jubatusで始める機械学習Jubatusで始める機械学習
Jubatusで始める機械学習JubatusOfficial
 
Jubatusでオンラインランク学習
Jubatusでオンラインランク学習Jubatusでオンラインランク学習
Jubatusでオンラインランク学習Yukihiro Tagami
 
Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門Shuzo Kashihara
 
世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイント世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイントNTT DATA OSS Professional Services
 
数式を使わないJubatus入門
数式を使わないJubatus入門数式を使わないJubatus入門
数式を使わないJubatus入門Kenji Aiko
 
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介瑛 村下
 
センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例JubatusOfficial
 
Jubatus分類器の活用テクニック
Jubatus分類器の活用テクニックJubatus分類器の活用テクニック
Jubatus分類器の活用テクニックJubatusOfficial
 
Jubatusハンズオン 機械学習はじめてみた
Jubatusハンズオン 機械学習はじめてみたJubatusハンズオン 機械学習はじめてみた
Jubatusハンズオン 機械学習はじめてみたJubatusOfficial
 

En vedette (18)

評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例
 
Jubatus on Mavericks
Jubatus on MavericksJubatus on Mavericks
Jubatus on Mavericks
 
jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証
 
標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例
 
Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介
 
Jubatusでuserとbrandのレコメンドを試してみた話
Jubatusでuserとbrandのレコメンドを試してみた話Jubatusでuserとbrandのレコメンドを試してみた話
Jubatusでuserとbrandのレコメンドを試してみた話
 
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
 
Jubatusで始める機械学習
Jubatusで始める機械学習Jubatusで始める機械学習
Jubatusで始める機械学習
 
Jubatusでオンラインランク学習
Jubatusでオンラインランク学習Jubatusでオンラインランク学習
Jubatusでオンラインランク学習
 
Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門
 
世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイント世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイント
 
数式を使わないJubatus入門
数式を使わないJubatus入門数式を使わないJubatus入門
数式を使わないJubatus入門
 
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
 
センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例
 
Jubatus分類器の活用テクニック
Jubatus分類器の活用テクニックJubatus分類器の活用テクニック
Jubatus分類器の活用テクニック
 
Jubatus casulatalks2
Jubatus casulatalks2Jubatus casulatalks2
Jubatus casulatalks2
 
A use case of online machine learning using Jubatus
A use case of online machine learning using JubatusA use case of online machine learning using Jubatus
A use case of online machine learning using Jubatus
 
Jubatusハンズオン 機械学習はじめてみた
Jubatusハンズオン 機械学習はじめてみたJubatusハンズオン 機械学習はじめてみた
Jubatusハンズオン 機械学習はじめてみた
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

What's New in the Berkeley Data Analytics Stack

  • 1. What's New in the Berkeley Data Analytics Stack Tathagata Das, Reynold Xin (AMPLab, UC Berkeley) Hadoop Summit 2013 UC BERKELEY
  • 2. Berkeley Data Analytics Stack Spark Shark SQL HDFS / Hadoop Storage Mesos / YARN Resource Manager Spark Streaming GraphX MLBase
  • 3. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos / YARN Resource Manager Spark Streaming GraphX MLBase
  • 4. Project History 2010: Spark (core execution engine) open sourced 2012: Shark open sourced Feb 2013: Spark Streaming alpha open sourced Jun 2013: Spark entered Apache Incubator
  • 5. Community 3000+ people online training 800+ meetup members 60+ developers contributing 17 companies contributing
  • 6. Hadoop and continuous computing: looking beyond MapReduce Bruno Fernandez-Ruiz, Senior Fellow & VP
  • 8. 2012 Hadoop Summit (Future of Apache Hadoop)
  • 9. 2012 Hadoop Summit (Future of Apache Hadoop) 2013 Hadoop Summit
  • 10. 2012 Hadoop Summit (Future of Apache Hadoop) 2013 Hadoop Summit (Hadoop Economics)
  • 11. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  • 12. Spark Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: »In-memory computing primitives »General computation graphs Improves usability through: »Rich APIs in Scala, Java, Python »Interactive shell Up to 100× faster (2-10× on disk) Often 5× less code
  • 13. Why a New Framework? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: »More complex, multi-pass analytics (e.g. ML, graph) »More interactive ad-hoc queries »More real-time stream processing
  • 14. Spark Programming Model Key idea: resilient distributed datasets (RDDs) »Distributed collections of objects »Can optionally be cached in memory across cluster »Manipulated through parallel operators »Automatically recomputed on failure Programming interface »Functional APIs in Scala, Java, Python »Interactive use from Scala and Python shell
  • 15. Example: Log Mining Exposes RDDs through a functional API in Java, Python, Scala lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) errors.persist() Block 1 Block 2 Block 3 Worke r errors.filter(_.contains(“foo”)).count() errors.filter(_.contains(“bar”)).count() tasks results Errors 2 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: 1 TB data in 5 sec (vs 170 sec for on-disk data) Worke r Errors 3 Worke r Errors 1 Master
  • 17. Machine Learning Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop MR Time per Iteration (s)
  • 18. Spark in Java and Python Python API lines = spark.textFile(…) errors = lines.filter( lambda s: "ERROR" in s) errors.count() Java API JavaRDD<String> lines = spark.textFile(…); errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } }); errors.count()
  • 19. Projects Building on Spark Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  • 20. GraphX Combining data-parallel and graph-parallel »Run graph analytics and ETL in the same engine »Consume graph computation output in Spark »Interactive shell Programmability »Support GraphLab / Pregel APIs in 20 LOC »Implement PageRank in 5 LOC Coming this summer as a Spark module
  • 21. Scalable Machine Learning Build a Classifier for X What you want to do What you have to do• Learn the internals of ML classification algorithms, sampling, featur e selection, X-validation,…. • Potentially learn Spark/Hadoop/… • Implement 3-4 algorithms • Implement grid-search to find the right algorithm parameters • Implement validation algorithms • Experiment with different sampling- sizes, algorithms, features • …. and in the end Ask For Help 21
  • 22. MLBase Making large scale machine learning easy »User specifies the task (e.g. “classify this dataset”) »MLBase picks the best algorithm and best parameters for the task Develop scalable, high-quality ML algorithms »Naïve Bayes »Logistic/Least Squares Regression (L1/L2 Regularization) »Matrix Factorization (ALS, CCD) »K-Means & DP-Means First release (summer): collection of scalable algorithms
  • 23. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  • 24. Shark Hive compatible: HiveQL, UDFs, metadata, etc. »Works in existing Hive warehouses without changing queries or data! Fast execution engine »Uses Spark as the underlying execution engine »Low-latency, interactive queries »Scales out and tolerate worker failures Easy to combine with Spark »Process data with SQL queries as well as raw
  • 25. Real-world Performance 0 25 50 75 100 Q1 Q2 Q3 Q4 Runtime(seconds) Shark Shark (disk) Hive 1.1 0.8 0.7 1.0 1.7 TB Real Warehouse Data on 100 EC2 nodes
  • 26. Comparison Impala Impala (mem) Redshift Shark (disk) Shark (mem) 0 5 10 15 20 Runtime (seconds) http://tinyurl.com/bigdata- benchmark
  • 27. Today’s Talk Spark Shark SQL HDFS / Hadoop Storage Mesos/YARN Resource Manager Spark Streaming GraphX MLBase
  • 28. Spark Streaming Extends Spark for large scale stream processing »Receive data directly from Kafka, Flume, Twitter, etc. »Fast, scalable, and fault-tolerant Simple, yet rich batch-like API »Easy to express your complex streaming computation »Fault-tolerant, stateful stream processing out of
  • 29. Motivation Many important applications must process large streams of live data and provide results in near- real-time » Social network trends » Website statistics » Intrusion detection systems » Etc.
  • 30. Challenges Require large clusters Require latencies of few seconds Require fault-tolerance Require integration with batch processing
  • 31. Integration with Batch Processing Many environments require processing same data in live streaming as well as batch post-processing Hard for any existing single framework to achieve both » Provide low latency for streaming workloads » Handle large volumes of data for batch workloads Extremely painful to maintain two stacks » Different programming models » Double the implementation effort » Double the number of bugs
  • 32. Existing Streaming Systems Storm – Limited fault-tolerance guarantee »Replays records if not processed »Processes each record at least once »May double count events! »Mutable state can be lost due to failure! Trident – Use transactions to update state »Processes each record exactly once »Per state transaction to external database is slow Neither integrate well with batch processing systems
  • 33. Spark Streaming • Chop up the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches Spark Spark Streamin g batches of X seconds live data stream processed results Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs
  • 34. Spark Streaming Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs • Batch sizes as low as ½ second, latency ~ 1 second • Potential for combining batch processing and streaming processing in the same system Spark Spark Streamin g batches of X seconds live data stream processed results
  • 35. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) DStream: a sequence of RDDs representing a stream of data batch @ t+1 batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  • 36. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) batch @ t+1 batch @ t batch @ t+2 tweets DStream Twitter Streaming API transformation: modify data in one DStream to create another DStream new DStream flatMap flatMap flatMap … new RDDs created for every batch hashTags Dstream [#cat, #dog, … ]
  • 37. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  • 38. Example: Get Twitter Hashtags val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.foreach(hashTagRDD => { … }) foreach: do whatever you want with the processed data flatMap flatMap flatMap foreach foreach foreach batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream Write to database, update analytics UI, do whatever you want
  • 39. Window-based Transformations val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() DStream of data sliding window operation window length sliding interval window length sliding interval
  • 40. Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data » Example: Maintain per-user mood as state, and update it with their tweets updateMood(newTweets, lastMood) => newMood moods = tweets.updateStateByKey(tweets => updateMood(tweets)) » Exactly-once semantics even under worker failures
  • 41. Arbitrary Combination of Batch and Streaming Computations Inter-mix RDD and DStream operations! » Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })
  • 42. DStream Input Sources Out of the box we provide »Kafka »Twitter »HDFS »Flume »Raw TCP sockets Very simple API to write a receiver for your own data source!
  • 43. Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency » Tested with 100 text streams on 100 EC2 instances with 4 cores each 0 1 2 3 4 5 6 7 0 50 100 ClusterThhroughput (GB/s) # Nodes in Cluster Grep 1 sec 2 sec 0 0.5 1 1.5 2 2.5 3 3.5 0 50 100 ClusterThroughput(GB/s) # Nodes in Cluster WordCount 1 sec 2 sec High Throughput and Low Latency
  • 44. Comparison with Storm Higher throughput than Storm »Spark Streaming: 670k records/second/node »Storm: 115k records/second/node 0 40 80 120 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Stor m 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm
  • 45. Fast Fault Recovery Recovers from faults/stragglers within 1 sec
  • 46. Real Applications: Traffic Sensing Traffic transit time estimation using online machine learning on GPS observations • Markov chain Monte Carlo simulations on GPS observations • Very CPU intensive, requires dozens of machines for useful computation • Scales linearly with cluster size 0 400 800 1200 1600 2000 0 20 40 60 80 GPSobservationspersecond # Nodes in Cluster
  • 47. Unifying Batch and Stream Models Spark program on Twitter log file using RDDs val tweets = sc.hadoopFile("hdfs://...") val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFile("hdfs://...") Spark Streaming program on Twitter stream using DStreams val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Same code base works for both batch processing and stream processing
  • 48. Conclusion Berkeley Data Analytics Stack »Next generation of data analytics stack with speed and functionality More information: www.spark-project.org Hands-on Tutorials: ampcamp.berkeley.edu »Video tutorials, EC2 exercises »AMP Camp 2 – August 29-30, 2013