SlideShare a Scribd company logo
1 of 35
Summingbird:
streaming portable map-reduce
Oscar Boykin | Twitter | @posco | @summingbird
@Twitter
What is summingbird?
2
1) Model for
streaming multi-
stage map-reduce
@Twitter
What is summingbird?
3
2) Implementations
to run this model
on Storm, Hadoop,
Spark and soon
@Twitter
What is summingbird?
4
2) Implementations
to run this model
on Storm, Hadoop,
Spark and soon
Portable
@Twitter
What is summingbird?
5
3) Systematic
implementation of
the “Lambda
Architecture”
@Twitter
What is summingbird?
6
3) Systematic
implementation of
the “Lambda
Architecture”
Fault Tolerant
@Twitter
What is streaming map-reduce?
7
Service
Source Source
Merge
SumByKey
Map
Map
Lookup
@Twitter
What is streaming map-reduce?
8
Lookup Service
Source Source
Merge
SumByKey
Map
Map
We can push single data
objects from either of the
sources, all the way
through the topology =>
Conceptually, state
can be updated
incrementally.
@Twitter 9
@Twitter 10
@Twitter 11
Why do I want this?
@Twitter 13
1) If our model assumes
streaming, one-at-a-time
semantics, we can run this code in
realtime (e.g. Storm) or in offline/
batch (e.g. Hadoop, Tez, Spark).
@Twitter
Again: Summingbird is a portability and abstraction layer
14
Summingbird allows you to write your job logic
once, and change the backend as needed.
Go from batch to realtime, from Storm to
Spark Streaming (eventually), from Hadoop to
Spark, from Spark to Tez (soon).
@Twitter 15
2) We have optimizers at the
summingbird layer, and leverage
those optimizers across platforms
(combining joins, map-side
combiners, data-cubing
optimizations).
@Twitter 16
3) If we restrict our reduce
operators to a very general class,
we can automatically build a
lambda architecture system.
What is the Lambda Architecture?
@Twitter
Lambda Architecture. @nathanmarz
http://lambda-architecture.net
18
But how do you build a lambda architecture?
@Twitter
All Hail the Monoid (associative operator)
20
2 + 3 = 61 +
@Twitter
All Hail the Monoid
21
2 + 3 = 61 +
=
5
All Hail the Monoid (associative operator)
@Twitter
All Hail the Monoid
22
2 + 3 = 61 +
=
3
All Hail the Monoid (associative operator)
@Twitter
Example Monoids
23
• (a min b) min c = a min (b min c)
• (a max b) max c = a max (b max c)
• (a or b) or c = a or (b or c)
• addition: (a + b) + c = a + (b + c)
• set union: (a u b) u c = a u (b u c)
• set intersection: (a n b) n c = a n (b n c)
• harmonic sum: 1/(1/a + 1/b)
• approximate unique count (HLL), approximate counter (CMS)
• and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Realtime sums
from 0, each batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
24
RT RT RT RT
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Realtime sums
from 0, each batch
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
25
Hadoop keeps a
total sum
(reliably)
RT RT RT RT
@Twitter
Batching and associativity yields reliability
Batch0 Batch1 Batch2 Batch3
fault tolerant
Noisy
Log
Hadoop Hadoop Hadoop Hadoop
Log Log Log
RT RT RT RT
26
Sum of RT
Batch(i) +
Hadoop Batch
(i-1)
has bounded
noise, bounded
read/write size.
Done at query
time
@Twitter
Lambda Architecture with Summingbird and Storehaus
27
Summingbird-scalding
Summingbird-storm
storehaus-memcache
storehaus-algebra
storehaus-hbase
Kafka
@Twitter
What has Twitter built with this?
28
* realtime dashboards: ads, operations,
publishers.
* stream transformation: filtering, mapping,
joining then exporting
* building realtime features for ML models.
* top-K applications: most viewed, most
clicked, etc..
@Twitter 29
f f f
+ + + + +
Tweets
(Flat)Mappers
Reducers
HDFS/Queue
HDFS/Queue
[(tweetid, CMS(domain -> 1)),
(0, CMS(tweetid -> 1))]
reduce: (x,y) =>
sum CMS tables
(x,y)
groupBy tweetid
@Twitter 31
• The CMS is fixed size, so it never blows up.
• delta = 1%, eps = 0.1% gives table size ~5000.
• Can query any (tweetid, 0 == all) for counts.
• Can simultaneously keep track of the keys with the highest counts (heavy-
hitters).
• Using heavy-hitters, you can see top embedded tweets.
• Add a time-bucket to the key for keeping history.
@Twitter
Review: @Summingbird is:
32
1) Portability/Optimization layer:
write once, run on many platforms
2) Systematic implementation of
Lambda Architecture: easy fault
tolerance, no design needed.
3) Real-world & high throughput.
@Twitter
Resources
33
twitter: @summingbird
mail: summingbird@groups.google.com
irc: freenode/#summingbird
github.com/twitter/summingbird
@Twitter
Join us!
34
Twitter is hiring people to use and develop @scalding
and @summingbird to build realtime analytics and ML.
twitter: @posco
email: oscar at twitter
Thank you!

More Related Content

What's hot

Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
PyData
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
Cloudera, Inc.
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 

What's hot (20)

Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introduction
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systems
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
 
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
 
Graphite
GraphiteGraphite
Graphite
 
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and Scylla
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uber
 
Zentral QueryCon 2018
Zentral QueryCon 2018Zentral QueryCon 2018
Zentral QueryCon 2018
 

Similar to Summingbird: Streaming Portable, MapReduce

Similar to Summingbird: Streaming Portable, MapReduce (20)

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Kyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdf
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
Tweet Cloud
Tweet CloudTweet Cloud
Tweet Cloud
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Algebraic data types: Semilattices
Algebraic data types: SemilatticesAlgebraic data types: Semilattices
Algebraic data types: Semilattices
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
 
Pregel
PregelPregel
Pregel
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Move from C to Go
Move from C to GoMove from C to Go
Move from C to Go
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Summingbird: Streaming Portable, MapReduce

  • 1. Summingbird: streaming portable map-reduce Oscar Boykin | Twitter | @posco | @summingbird
  • 2. @Twitter What is summingbird? 2 1) Model for streaming multi- stage map-reduce
  • 3. @Twitter What is summingbird? 3 2) Implementations to run this model on Storm, Hadoop, Spark and soon
  • 4. @Twitter What is summingbird? 4 2) Implementations to run this model on Storm, Hadoop, Spark and soon Portable
  • 5. @Twitter What is summingbird? 5 3) Systematic implementation of the “Lambda Architecture”
  • 6. @Twitter What is summingbird? 6 3) Systematic implementation of the “Lambda Architecture” Fault Tolerant
  • 7. @Twitter What is streaming map-reduce? 7 Service Source Source Merge SumByKey Map Map Lookup
  • 8. @Twitter What is streaming map-reduce? 8 Lookup Service Source Source Merge SumByKey Map Map We can push single data objects from either of the sources, all the way through the topology => Conceptually, state can be updated incrementally.
  • 12. Why do I want this?
  • 13. @Twitter 13 1) If our model assumes streaming, one-at-a-time semantics, we can run this code in realtime (e.g. Storm) or in offline/ batch (e.g. Hadoop, Tez, Spark).
  • 14. @Twitter Again: Summingbird is a portability and abstraction layer 14 Summingbird allows you to write your job logic once, and change the backend as needed. Go from batch to realtime, from Storm to Spark Streaming (eventually), from Hadoop to Spark, from Spark to Tez (soon).
  • 15. @Twitter 15 2) We have optimizers at the summingbird layer, and leverage those optimizers across platforms (combining joins, map-side combiners, data-cubing optimizations).
  • 16. @Twitter 16 3) If we restrict our reduce operators to a very general class, we can automatically build a lambda architecture system.
  • 17. What is the Lambda Architecture?
  • 19. But how do you build a lambda architecture?
  • 20. @Twitter All Hail the Monoid (associative operator) 20 2 + 3 = 61 +
  • 21. @Twitter All Hail the Monoid 21 2 + 3 = 61 + = 5 All Hail the Monoid (associative operator)
  • 22. @Twitter All Hail the Monoid 22 2 + 3 = 61 + = 3 All Hail the Monoid (associative operator)
  • 23. @Twitter Example Monoids 23 • (a min b) min c = a min (b min c) • (a max b) max c = a max (b max c) • (a or b) or c = a or (b or c) • addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • set intersection: (a n b) n c = a n (b n c) • harmonic sum: 1/(1/a + 1/b) • approximate unique count (HLL), approximate counter (CMS) • and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2]
  • 24. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 24 RT RT RT RT
  • 25. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Realtime sums from 0, each batch Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 25 Hadoop keeps a total sum (reliably) RT RT RT RT
  • 26. @Twitter Batching and associativity yields reliability Batch0 Batch1 Batch2 Batch3 fault tolerant Noisy Log Hadoop Hadoop Hadoop Hadoop Log Log Log RT RT RT RT 26 Sum of RT Batch(i) + Hadoop Batch (i-1) has bounded noise, bounded read/write size. Done at query time
  • 27. @Twitter Lambda Architecture with Summingbird and Storehaus 27 Summingbird-scalding Summingbird-storm storehaus-memcache storehaus-algebra storehaus-hbase Kafka
  • 28. @Twitter What has Twitter built with this? 28 * realtime dashboards: ads, operations, publishers. * stream transformation: filtering, mapping, joining then exporting * building realtime features for ML models. * top-K applications: most viewed, most clicked, etc..
  • 30. f f f + + + + + Tweets (Flat)Mappers Reducers HDFS/Queue HDFS/Queue [(tweetid, CMS(domain -> 1)), (0, CMS(tweetid -> 1))] reduce: (x,y) => sum CMS tables (x,y) groupBy tweetid
  • 31. @Twitter 31 • The CMS is fixed size, so it never blows up. • delta = 1%, eps = 0.1% gives table size ~5000. • Can query any (tweetid, 0 == all) for counts. • Can simultaneously keep track of the keys with the highest counts (heavy- hitters). • Using heavy-hitters, you can see top embedded tweets. • Add a time-bucket to the key for keeping history.
  • 32. @Twitter Review: @Summingbird is: 32 1) Portability/Optimization layer: write once, run on many platforms 2) Systematic implementation of Lambda Architecture: easy fault tolerance, no design needed. 3) Real-world & high throughput.
  • 33. @Twitter Resources 33 twitter: @summingbird mail: summingbird@groups.google.com irc: freenode/#summingbird github.com/twitter/summingbird
  • 34. @Twitter Join us! 34 Twitter is hiring people to use and develop @scalding and @summingbird to build realtime analytics and ML. twitter: @posco email: oscar at twitter