2. Scenario
• Situa,on:
– You
have
hundreds
of
services
producing
logs
– You’re
running
a
daily
cron
job
on
the
logs
• Rota,ng
the
logs
• Maybe
compressing
or
otherwise
processing
them
• Transferring
them
to
HDFS
(the
Hadoop
Distributed
File
System)
• Problem:
– As
the
amount
of
data
increases,
it
takes
longer
and
longer
to
run
the
cron
job
7/15/2010 2
3. You
need
a
“Flume”
• Flume
is
a
distributed
system
that
gets
your
logs
from
their
source
and
aggregates
them
to
where
you
want
to
process
them
• Open
source,
Apache
v2.0
License
• Goals:
– Reliability
– Scalability
– Extensibility
– Manageability
Columbia Gorge, Broughton Log Flume
7/15/2010 3
4. Use
cases
• Collec,ng
logs
from
nodes
in
your
Hadoop
cluster
• Collec,ng
logs
from
services
such
as
hUpd,
mail,
etc.
• Collec,ng
impressions
from
custom
apps
for
an
ad
network
• But
wait,
there’s
more!
It’s log, log ... Everyone wants a log!
– Basic
online
in-‐stream
analysis
– Online
in-‐stream
file
processing
and
manipula,on
7/15/2010 4
5. Key
abstrac,ons
• Data
path
and
control
path
Agent
• Nodes
are
in
the
data
path
– Nodes
have
a
source
and
a
sink
Collector
– They
can
take
different
roles
• A
typical
topology
has
agent
nodes
and
collector
nodes
• Op,onally
it
has
processor
nodes
• Masters
are
in
the
control
path
Master
– Centralized
point
of
configura,on
– Specify
sources
and
sinks
– Can
control
flows
of
data
between
nodes
– Use
one
master
or
use
many
with
a
ZooKeeper-‐backed
quorum
7/15/2010 5
8. Outline
• What
is
Flume?
– Goals
and
architecture
• Reliability
– Fault-‐tolerance
and
High
availability
• Scalability
– Horizontal
scalability
of
all
nodes
and
masters
• Extensibility
– Unix
principle,
all
kinds
of
data,
all
kinds
of
sources,
all
kinds
of
sinks
• Manageability
– Centralized
management
suppor,ng
dynamic
reconfigura,on
7/15/2010 8
9. RELIABILITY
The logs will still get there…
7/15/2010 9
10. Tunable
data
reliability
levels
• Best
effort
– Fire
and
forget
Agent Collector HDFS
• Store
on
failure
+
retry
– Local
acks,
local
errors
detectable
Agent Collector HDFS
– Failover
when
faults
detected
• End-‐to-‐end
reliability
– End
to
end
acks
Agent Collector HDFS
– Data
survives
compound
failures,
and
may
be
retried
mul,ple
,mes
7/15/2010 10
13. Data
path
is
horizontally
scalable
Agent
Agent Collector HDFS
Agent
Agent
• Add
collectors
to
increase
availability
and
to
handle
more
data
– Assumes
a
single
agent
will
not
dominate
a
collector
– Fewer
connec,ons
to
HDFS
– Larger,
more
efficient
writes
to
HDFS
• Agents
have
mechanisms
for
machine
resource
tradeoffs
• Write
log
locally
to
avoid
collector
disk
IO
boUleneck
and
catastrophic
failures
• Compression
and
batching
(trade
cpu
for
network)
• Push
computa,on
into
the
event
collec,on
pipeline
(balance
IO,
Mem,
and
CPU
resource
boUlenecks)
7/15/2010 13
14. Load
balancing
Agent
Agent Collector
Agent
Agent Collector
Agent Collector
Agent
• Agents
are
logically
par,,oned
and
can
send
to
different
collectors
• Use
randomiza,on
to
pre-‐specify
failovers
when
many
collectors
exist
• Spread
load
if
a
collector
goes
down
• Spread
load
if
new
collectors
are
added
to
the
system
7/15/2010 14
15. Load
balancing
Agent
Agent Collector
Agent
Agent Collector
Agent Collector
Agent
• Agents
are
logically
par,,oned
and
can
send
to
different
collectors
• Use
randomiza,on
to
pre-‐specify
failovers
when
many
collectors
exist
• Spread
load
if
a
collector
goes
down
• Spread
load
if
new
collectors
are
added
to
the
system
7/15/2010 15
16. Control
plane
is
horizontally
scalable
Node Master
ZK1
Node Master
ZK2
Node Master
ZK3
• A
master
controls
dynamic
configura,ons
of
nodes
– Uses
consensus
protocol
to
keep
state
consistent
– Scales
well
for
configura,on
reads
– Allows
for
adap,ve
repar,,oning
in
the
future
• Nodes
can
talk
to
any
master
• Masters
can
talk
to
any
ZooKeeper
member
7/15/2010 16
17. Control
plane
is
horizontally
scalable
Node Master
ZK1
Node Master
ZK2
Node Master
ZK3
• A
master
controls
dynamic
configura,ons
of
nodes
– Uses
consensus
protocol
to
keep
state
consistent
– Scales
well
for
configura,on
reads
– Allows
for
adap,ve
repar,,oning
in
the
future
• Nodes
can
talk
to
any
master
• Masters
can
talk
to
any
ZooKeeper
member
7/15/2010 17
18. Control
plane
is
horizontally
scalable
Node Master
ZK1
Node Master
ZK2
Node Master
ZK3
• A
master
controls
dynamic
configura,ons
of
nodes
– Uses
consensus
protocol
to
keep
state
consistent
– Scales
well
for
configura,on
reads
– Allows
for
adap,ve
repar,,oning
in
the
future
• Nodes
can
talk
to
any
master
• Masters
can
talk
to
any
ZooKeeper
member
7/15/2010 18
19. EXTENSIBILITY
Turn raw logs into something useful…
7/15/2010 19
20. Flume
is
easy
to
extend
• Simple
source
and
sink
APIs
– Event
granularity
streaming
design
– Have
many
simple
opera,ons
and
compose
for
complex
behavior
• End-‐to-‐end
principle
– Put
smarts
and
state
at
the
end
points.
Keep
the
middle
simple
• Flume
deals
with
reliability
– Just
add
a
new
source
or
add
a
new
sink
and
Flume
has
primi,ves
to
deal
with
reliability
7/15/2010 20
21. Variety
of
Data
sources
• Can
deal
with
push
and
pull
sources
push
App
Agent
• Supports
many
legacy
event
sources
– Tailing
a
file
poll
App
Agent
– Output
from
periodically
Exec’ed
program
– Syslog,
Syslog-‐ng
– Experimental:
IRC
/
TwiUer
/
Scribe
/
AMQP
embed
App
Agent
7/15/2010 21
22. Variety
of
Data
output
• Send
data
to
many
sinks
– HDFS,
Files,
Console,
RPC
– Experimental:
HBase,
Voldemort,
S3,
etc…
• Supports
an
extensible
variety
of
outputs
formats
and
des,na,ons
– Output
to
language-‐neutral
and
open
data
formats
(JSON,
Avro,
text)
– Compressed
output
files
in
development
• Uses
decorators
to
process
event
data
in-‐flight
– Sampling,
aUribute
extrac,on,
filtering,
projec,on,
checksumming,
batching,
wire
compression,
etc…
7/15/2010 22
24. Centralized
data
flow
management
• Master
specifies
node
sources,
sinks
and
data
flows
– Simply
specify
the
role
of
the
node:
collector,
agent
– Or
specify
a
custom
configura,on
for
a
node
• Control
Interfaces:
– Flume
Shell
– Basic
Web
– HUE
+
Flume
Manager
App
(Enterprise
users)
7/15/2010 24
26. For
advanced
users
• A
concise
and
precise
configura,on
language
for
specifying
arbitrary
data
paths
– Dataflows
are
essen,ally
DAGs
– Control
specific
event
flows
• Enable
durability
mechanism
and
failover
mechanisms
• Tune
the
parameters
these
mechanisms
– Dynamic
updates
of
configura,ons
• Allows
for
live
failover
changes
• Allows
for
handling
newly
provisioned
machines
• Allows
for
changing
analy,cs
7/15/2010 26
28. Summary
• Flume
is
a
distributed,
reliable,
scalable
system
for
collec,ng
and
delivering
high-‐volume
con,nuous
event
data
such
as
logs
– Tunable
data
reliability
levels
– Reliable
master
backed
by
ZooKeeper
– Write
data
to
HDFS
into
buckets
ready
for
batch
processing
– Dynamically
configurable
nodes
– Simplified
automated
management
for
agent+collector
topologies
• Open
Source
Apache
v2.0
license
7/15/2010 28