Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. In order
to meet this challenge, Twitter designed an end to end real-time stack consisting of DistributedLog, the distributed and replicated messaging system system, and Heron, the streaming system for real time computation. DistributedLog is a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s publish-subscribe system. Twitter Heron is the next generation streaming system built from ground up to address our scalability and reliability needs. Both the systems have been in production for nearly two years and is widely used at Twitter in a range of diverse applications such as search ingestion pipeline, ad analytics, image classification and more. These slides will describe Heron and DistributedLog in detail, covering a few use cases in-depth and sharing the operating experiences and challenges of running large-scale real time systems at scale.
3. 3
Value of Data
It’s contextual
Value&of&Data&to&Decision/Making&
Time&
Preven8ve/&
Predic8ve&
Ac8onable&
Reac8ve&
Historical&
Real%&
Time&
Seconds& Minutes& Hours& Days&
Tradi8onal&“Batch”&&&&&&&&&&&&&&&
Business&&Intelligence&
Informa9on&Half%Life&
In&Decision%Making&
Months&
Time/cri8cal&
Decisions&
[1]
Courtesy
Michael
Franklin,
BIRTE,
2015.
4. 4
What is Real-Time?
BATCH
high throughput
> 1 hour
monthly active users
relevance for ads
adhoc
queries
REAL TIME
low latency
< 1 ms
Financial
Trading
ad impressions count
hash tag trends
approximate
10 ms - 1 sec
Near Real
Time
latency sensitive
< 500 ms
fanout Tweets
search for Tweets
deterministic
workflows
OLTP
It’s contextual
5. 5
Why Real Time?
G
Emerging break out
trends in Twitter (in the
form #hashtags)
Ü
Real time sports
conversations related
with a topic (recent goal
or touchdown)
!
Real time product
recommendations based
on your behavior &
profile
real time searchreal time trends real time conversations real time recommendations
Real time search of
tweets
s
ANALYZING BILLIONS OF EVENTS IN REAL TIME IS A CHALLENGE!
7. 7
Real Time Use Cases
Online Services
10s of ms
Near Real Time
100s of ms
Data for Batch Analytics
secs to mins
TransacKon
log,
Queues,
RPCs
Change
propagaKon,
Streaming
analyKcs
Log
aggregaKon,
Client
events
I
9. 9
Scribe
Open source log aggregation
Originally
from
Facebook.
TwiRer
made
significant
enhancements
for
real
Kme
event
aggregaKon
High throughput and scale
Delivers
125M
messages/min.
Provides
Kght
SLAs
on
data
reliability
Runs on every machine
Simple,
very
reliable
and
efficiently
uses
memory
and
CPU
!
{
"
10. Event
Bus
&
Distributed
Log
Next Generation Messaging
"
12. 12
Kestrel Limitations
Adding subscribers is expensive
Scales poorly as #queues increase
Durability is hard to achieve
Read-behind degrades performance
Too many random I/Os
Cross DC replication
!
#"
7!
13. 13
Kafka Limitations
Relies on file system page cache
Performance degradation when subscribers
fall behind - too much random I/O
!
"
14. 14
Rethinking Messaging
Durable writes, intra cluster and
geo-replication
Scale resources independently
Cost efficiency
Unified Stack - tradeoffs for
various workloads
Multi tenancy
Ease of Manageability
!
#"
7!
15. 15
Event Bus
Durable writes, intra cluster and
geo-replication
Scale resources independently
Cost efficiency
Unified Stack - tradeoffs for
various workloads
Multi tenancy
Ease of Manageability
!
#"
7!
23. 23
Twitter Heron
Batching of tuples
AmorKzing
the
cost
of
transferring
tuples !
Task isolation
Ease
of
debug-‐ability/isolaKon/profiling
#Fully API compatible with Storm
Directed
acyclic
graph
Topologies,
Spouts
and
Bolts
"
Support for back pressure
Topologies
should
self
adjusKng
gUse of main stream languages
C++,
Java
and
Python !
Efficiency
Reduce resource consumption
G
Design: Goals
25. 25
Heron Terminology
Topology
Directed
acyclic
graph
verKces
=
computaKon,
and
edges
=
streams
of
data
tuples
Spouts
Sources
of
data
tuples
for
the
topology
Examples
-‐
Ka]a/Kestrel/MySQL/Postgres
Bolts
Process
incoming
tuples,
and
emit
outgoing
tuples
Examples
-‐
filtering/aggregaKon/join/any
funcKon
,
%
27. 27
Stream Groupings
01 02 03 04
Shuffle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,
42. 42
Lambda Architecture - The Good
Event
BusScribe
CollecKon
Pipeline Heron
AnalyKcs
Pipeline Results
43. 43
Lambda Architecture - The Bad
Have to fix everything (may be twice)!
How much Duct Tape required?
Have to write everything twice!
Subtle differences in semantics
What about Graphs, ML, SQL, etc?
!
#"
7!
44. 44
Summingbird to the Rescue
Summingbird
Program
Scalding/Map
Reduce
HDFS
Message
broker
Heron
Topology
Online
key
value
result
store
Batch
key
value
result
store
Client
45. 45
Curious to Learn More?
Twitter Heron: Stream Processing at Scale
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg,
Sailesh Mittal, Jignesh M. Patel*,1
, Karthik Ramasamy, Siddarth Taneja
@sanjeevrk, @challenger_nik, @Louis_Fumaosong, @vikkyrk, @cckellogg,
@saileshmittal, @pateljm, @karthikz, @staneja
Twitter, Inc., *University of Wisconsin – Madison
ABSTRACT
Storm has long served as the main platform for real-time analytics
at Twitter. However, as the scale of data being processed in real-
time at Twitter has increased, along with an increase in the
diversity and the number of use cases, many limitations of Storm
have become apparent. We need a system that scales better, has
better debug-ability, has better performance, and is easier to
manage – all while working in a shared cluster infrastructure. We
considered various alternatives to meet these needs, and in the end
concluded that we needed to build a new real-time stream data
processing system. This paper presents the design and
implementation of this new system, called Heron. Heron is now
system process, which makes debugging very challenging. Thus, we
needed a cleaner mapping from the logical units of computation to
each physical process. The importance of such clean mapping for
debug-ability is really crucial when responding to pager alerts for a
failing topology, especially if it is a topology that is critical to the
underlying business model.
In addition, Storm needs dedicated cluster resources, which requires
special hardware allocation to run Storm topologies. This approach
leads to inefficiencies in using precious cluster resources, and also
limits the ability to scale on demand. We needed the ability to work
in a more flexible way with popular cluster scheduling software that
Storm @Twitter
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni,
Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy Ryaboy
@ankitoshniwal, @staneja, @amits, @karthikz, @pateljm, @sanjeevrk,
@jason_j, @krishnagade, @Louis_Fumaosong, @jakedonham, @challenger_nik, @saileshmittal, @squarecog
Twitter, Inc., *University of Wisconsin – Madison
46. 46
Interested in Heron?
CONTRIBUTIONS ARE WELCOME!
https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SOURCED
FOLLOW US @HERONSTREAMING
47. 47
Interested in Distributed Log?
CONTRIBUTIONS ARE WELCOME!
https://github.com/twitter/heron
http://distributedlog.io
DISTRIBUTED LOG IS OPEN SOURCED
FOLLOW US @DISTRIBUTEDLOG