Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Apache Flume
Getting Logs/Data to Hadoop
Steve Hoffman
Chicago Hadoop User Group (CHUG)
2014-04-09T10:30:00Z

About Me
• Steve Hoffman
• twitter: @bacoboy 
else: http://bit.ly/bacoboy

About Me
• Steve Hoffman
• Tech Guy @Orbitz

About Me
• Steve Hoffman
• Tech Guy @Orbitz
• Wrote a book on Flume

Why do I need Flume?
• Created to deal with streaming data/logs to HDFS
• Can’t mount HDFS (usually)
• Can’t “copy” to files to HDFS if the files aren’t closed (aka
log files)
• Need to buffer “some”, then write and close a file — repeat
• May involve multiple hops due to topology (# of machines,
datacenter separation, etc).
• A lot can go wrong here…

Agent
• Java daemon
• Has a name (usually ‘agent’)
• Receive data from sources and write events to 1
or more channels
• Move events from 1 channel to sink. Remove
from channel if successfully written.

Events
• Headers = Key/Value Pairs — Map<String,
String>
• Body = byte array — byte[]
• For example:
10.10.1.1 - - [29/Jan/2014:03:36:04 -0600] "HEAD /ping.html
HTTP/1.1" 200 0 "-" "-" “-"!
{“timestamp”:”1391986793111”, “host”:”server1.example.com”}
31302e31302e312e31202d202d205b32392f4a616e2f323031343a30333a33
363a3034202d303630305d202248454144202f70696e672e68746d6c204854
54502f312e312220323030203020222d2220222d2220222d22

Channels
• Place to hold Events
• Memory or File Backed (also JDBC, but why?)
• Bounded - Size is conﬁgurable
• Resources aren’t inﬁnite

Sources
• Feeds data to one or more Channels
• Usually data pushed to it (listen for data on a
socket. i.e. HTTP Source) or from Avro log4J
appender.
• Or can periodically poll another system and
generate events (i.e. run a command every minute,
and parse output into Event, Query a DB/Mongo/
etc.)

Sinks
• Move Events from a single Channel to a
destination
• Only removes from Channel if write successful
• HDFSSink you’ll use the most 
— most likely…

Conﬁguration Sample
# Agent named ‘agent’!
# Input (source)!
agent.sources.r1.type = seq!
agent.sources.r1.channels = c1!
!
# Output (sink)!
agent.sinks.k1.type = logger!
agent.sinks.k1.channel = c1!
!
# Channel!
agent.channels.c1.type = memory!
agent.channels.c1.capacity = 1000!
!
# Wire everything together!
agent.sources = r1!
agent.sinks = k1!
agent.channels = c1!

Startup
# Input (source)!
!
# Output (sink)!
!
# Channel!
!
agent.sources = r1!
agent.sinks = k1!
name.{sources|sinks|channels}

Startup
# Input (source)!
!
# Output (sink)!
!
# Channel!
!
agent.sources = r1!
agent.sinks = k1!
Find instance name + type

Startup
# Input (source)!
!
# Output (sink)!
!
# Channel!
!
agent.sources = r1!
agent.sinks = k1!
Connect channel(s)

Startup
# Input (source)!
!
# Output (sink)!
!
# Channel!
!
agent.sources = r1!
agent.sinks = k1!
Connect channel(s)
Apply type speciﬁc 
conﬁgurations

Startup
# Input (source)!
!
# Output (sink)!
!
# Channel!
!
agent.sources = r1!
agent.sinks = k1!
Connect channel(s)
Apply type specific 
configurations
RTM - Flume User Guide 
https://flume.apache.org/FlumeUserGuide.html"
or my book :)

Conﬁguration Sample (logs)
Creating channels!
Creating instance of channel c1 type memory!
Created channel c1!
Creating instance of source r1, type seq!
Creating instance of sink: k1, type: logger!
Channel c1 connected to [r1, k1]!
Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner:
{ source:org.apache.flume.source.SequenceGeneratorSource{name:r1,state:IDLE}
counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner:
{ policy:org.apache.flume.sink.DefaultSinkProcessor@19484a05 counterGroup:{ name:null
counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }!
Event: { headers:{} body: 30 0 }!
and so on…

Using Cloudera Manager
• Same stuff, just in 
a GUI
• Centrally managed in a
Database (instead of
source control/Git)
• Distributed from central
location (instead of
Chef/Puppet)

Multiple destinations need 
multiple channels

Channel Selector
• When more than 1 channel speciﬁed on Source
• Replicating (Each channel gets a copy) - default
• Multiplexing (Channel picked based on a header
value)
• Custom (If these don’t work for you - code one!)

Channel Selector 
Replicating
• Copy sent to all channels associated with Source
agent.sources.r1.selector.type=replicating 
agent.sources.r1.channels=c1
c2
c3

• Can specify “optional” channels
agent.sources.r1.selector.optional=c3

• Transaction success if all non-optional channels
take the event (in this case c1 & c2)

Channel Selector 
Multiplexing
• Copy sent to only some of the channels
agent.sources.r1.selector.type=multiplexing 
agent.sources.r1.channels=c1
c2
c3
c4

• Switch based on header key  
(i.e. {“currency”:“USD”} → c1)
agent.sources.r1.selector.header=currency 
agent.sources.r1.selector.mapping.USD=c1 
agent.sources.r1.selector.mapping.EUR=c2
c3 
agent.sources.r1.selector.default=c4

Interceptors
• Zero or more on Source (before written to channel)
• Zero or more on Sink (after read from channel)
• Or Both
• Use for transformations of data in-ﬂight (headers OR body)
public
Event
intercept(Event
event); 
public
List<Event>
intercept(List<Event>
events);

• Return null or empty List to drop Events

Interceptor Chaining
• Processed in Order Listed in Conﬁguration (source r1 example):
agent.sources.r1.interceptors=i1
i2
i3 
agent.sources.r1.interceptors.i1.type=timestamp 
agent.sources.r1.interceptors.i1.preserveExisting=true 
agent.sources.r1.interceptors.i2.type=static 
agent.sources.r1.interceptors.i2.key=datacenter 
agent.sources.r1.interceptors.i2.value=CHI 
agent.sources.r1.interceptors.i3.type=host 
agent.sources.r1.interceptors.i3.hostHeader=relay 
agent.sources.r1.interceptors.i3.useIP=false

• Resulting Headers added before writing to Channel:
{“timestamp”:“1392350333234”,
“datacenter”:“CHI”,

“relay”:“flumebox.example.com”}

Morphlines
• Interceptor and Sink forms.
• See Cloudera Website/Blog
• Created to ease transforms and Cloudera Search/Flume integration.
• An example:
# convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ" 
# The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" 
# or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd". 
convertTimestamp { 
field : timestamp 
inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'",  
"yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"] 
inputTimezone : America/Chicago 
outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" 
outputTimezone : UTC 
}

Avro
• Apache Avro - Data Serialization
• http://avro.apache.org/
• Storage Format and Wire Protocol
• Self-Describing (schema written with the data)
• Supports Compression of Data (not container — so
MapReduce friendly — “splitable”)
• Binary friendly — Doesn’t require records separated by n

Avro Source/Sink
• Preferred inter-agent transport in Flume
• Simple Conﬁguration (host + port for sink and port for source)
• Minimal transformation needed for Flume Events
• Version of Avro in client & server don’t need to match — only
payload versioning matters 
(think protocol buffers vs Java serialization)

Avro Source/Sink Conﬁg
foo.sources=… 
foo.channels=channel-‐foo 
foo.channels.channel-‐foo.type=memory 
foo.sinks=sink-‐foo 
foo.sinks.sink-‐foo.channel=channel-‐foo 
foo.sinks.sink-‐foo.type=avro 
foo.sinks.sink-‐foo.hostname=bar.example.com 
foo.sinks.sink-‐foo.port=12345 
foo.sinks.sink-‐foo.compression-‐type=deflate

bar.sources=datafromfoo 
bar.sources.datafromfoo.type=avro 
bar.sources.datafromfoo.bind=0.0.0.0 
bar.sources.datafromfoo.port=12345 
bar.sources.datafromfoo.compression-‐type=deflate 
bar.sources.datafromfoo.channels=channel-‐bar 
bar.channels=channel-‐bar 
bar.channels.channel-‐bar.type=memory 
bar.sinks=…

log4j Avro Sink
• Remember that Web 
Server pushing data to 
Source?
• Use the Flume Avro log4j appender!
• log level, category, etc. become headers in Event
• “message” String becomes the body

log4j Conﬁguration
• log4j.properties sender (include flume-‐ng-‐sdk-‐1.X.X.jar in project):
log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAp
pender 
log4j.appender.flume.Hostname=example.com 
log4j.appender.flume.Port=12345 
log4j.appender.flume.UnsafeMode=true 
 
log4j.logger.org.example.MyClass=DEBUG,flume

• ﬂume avro receiver:
agent.sources=logs 
agent.sources.logs.type=avro 
agent.sources.logs.bind=0.0.0.0 
agent.sources.logs.port=12345 
agent.sources.logs.channels=…

Avro Client
• Send data to AvroSource from command line
• Run flume program with avro-‐client instead of agent
parameter
$
bin/flume-‐ng
avro-‐client
-‐H
server.example.com 

-‐p
12345
[-‐F
input_file]

• Each line of the file (or stdin if no file given) becomes an
event
• Useful for testing or injecting data from outside Flume sources
(ExecSource vs cronjob which pipes output to avro-‐client).

HDFSSink
• Read from Channel and write  
to a ﬁle in HDFS in chunks
• Until 1 of 3 things happens:
• some amount of time elapses (rollInterval)
• some number of records have been written (rollCount)
• some size of data has been written (rollSize)
• Close that ﬁle and start a new one

HDFS Conﬁguration
foo.sources=… 
foo.channels=channel-‐foo 
foo.channels.channel-‐foo.type=memory 
foo.sinks=sink-‐foo 
foo.sinks.sink-‐foo.channel=channel-‐foo 
foo.sinks.sink-‐foo.type=hdfs 
foo.sinks.sink-‐foo.hdfs.path=hdfs://NN/data/%Y/%m/%d/%H 
foo.sinks.sink-‐foo.hdfs.rollInterval=60 
foo.sinks.sink-‐foo.hdfs.filePrefix=log 
foo.sinks.sink-‐foo.hdfs.fileSuffix=.avro 
foo.sinks.sink-‐foo.hdfs.inUsePrefix=_ 
foo.sinks.sink-‐foo.serializer=avro_event 
foo.sinks.sink-‐foo.serializer.compressionCodec=snappy

HDFS writing…
drwxr-‐x-‐-‐-‐

-‐
flume
flume

0
2014-‐02-‐16
17:04
/data/2014/02/16/23 
-‐rw-‐r-‐-‐-‐-‐-‐

3
flume
flume

0
2014-‐02-‐16
17:04
/data/2014/02/16/23/_log.1392591607925.avro.tmp 
-‐rw-‐r-‐-‐-‐-‐-‐

3
flume
flume

1877
2014-‐02-‐16
17:01
/data/2014/02/16/23/log.1392591607923.avro 
-‐rw-‐r-‐-‐-‐-‐-‐

3
flume
flume

1955
2014-‐02-‐16
17:02
/data/2014/02/16/23/log.1392591607924.avro 
-‐rw-‐r-‐-‐-‐-‐-‐

3
flume
flume

2390
2014-‐02-‐16
17:04
/data/2014/02/16/23/log.1392591798436.avro

• The zero length .tmp file is the current file. Won’t
see real size until it closes (just like when you do a
hadoop
fs
-‐put)
• Use …hdfs.inUsePrefix=_ to prevent open files
from being included in MapReduce jobs

Event Serializers
• Deﬁnes how the Event gets written to Sink
• Just the body as a UTF-8 String
agent.sinks.foo-‐sink.serializer=text

• Headers and Body as UTF-8 String
agent.sinks.foo-‐sink.serializer=header_and_text

• Avro (Flume record Schema)
agent.sinks.foo-‐sink-‐serializer=avro_event

• Custom (none of the above meets your needs)

Source: https://xkcd.com/1179/
Too Many…

Timezones are Evil
• Daylight savings time causes problems twice a year (in Spring: no 2am
hour. In Fall: twice the data during 2am hour — 02:15? Which one?)
• Date processing in MapReduce jobs: Hourly jobs, ﬁlters, etc.
• Dated paths: hdfs://NN/data/%Y/%m/%d/%H
• Use UTC: -‐Duser.timezone=UTC

• Use one of the ISO8601 formats like 2014-‐02-‐26T18:00:00.000Z
• Sorts the way you usually want
• Every time library supports it* - and if not, easy to parse.

Generally Speaking…
• Async handoff doesn’t work under load when bad
stuff happens
Write Read
Filesystem
or
Queue
or
Database
or whatever
Not ∞

Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log

Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1

Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log
foo.log.1

Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log
foo.log.2

Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1 foo.log.2

Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log

Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log
X

Don’t Use Tail
• Tailing a ﬁle for input is bad - assumptions are made that
aren’t guarantees.
• Direct support removed during Flume rewrite
• Handoff can go bad with ﬁles: when writer faster than
reader
• With Queue: when reader doesn’t read before expire time
• No way to apply “back pressure” to tell tail there is a
problem. It isn’t listening…

What can I use?
• If you can’t use the log4j Avro Appender…
• Use logrotate to move old logs to “spool” directory
• SpoolingDirectorySource
• Finally, cron job to remove .COMPLETED ﬁles (for
delayed delete) OR set deletePolicy=true
(immediate)
• Alternatively use log rotate with avro_client? (probably
other ways too…)

RAM or Disk Channels?
Source: 
http://blog.scoutapp.com/articles/2011/02/10/
understanding-disk-i-o-when-should-you-be-worried

Duplicate Events
• Transactions only at Agent level
• You may see Events more than once
• Distributed Transactions are expensive
• Just deal with in query/scrub phase — much less
costly than trying to prevent it from happening

Late Data
• Data could be “late”/delayed
• Outages
• Restarts
• Act of Nature
• Only sure thing is a “database” — single write + ACK
• Depending on your monitoring, it could be REALLY
LATE.

Monitoring
• Know when it breaks so you can ﬁx it before you can’t ingest new data
(and it is lost)
• This time window is small if volume is high
• Flume Monitoring still WIP, but hooks are there

Other Operational Concerns
• resource utilization - number of open files when
writing (file descriptors), disk space used for file
channel, disk contention, disk speed*
• number of inbound and outbound sockets - may
need to tier (Avro Source/Sink)
• minimize hops if possible - another place for data
to get stuck

Not everything is a nail
• Flume is great for handling individual records
• What if you need to compute an average?
• Get a Stream Processing system
• Storm (twitter’s)
• Samza (linkedIn’s)
• Others…
• Flume can co-exist with these — use most appropriate tool

Questions?
…and thanks!
Slides @ http://slideshare.net/bacoboy

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Similaire à Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014 (20)

Plus de Steve Hoffman

Plus de Steve Hoffman (6)

Dernier

Dernier (20)

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014