Contenu connexe
Similaire à Apache Flume HBase Integration
Similaire à Apache Flume HBase Integration (20)
Plus de Alexander Alten-Lorenz
Plus de Alexander Alten-Lorenz (11)
Apache Flume HBase Integration
- 1. Buzzwords Berlin HBase Hackathon, June 2012
Apache Flume and HBase
Alexander Alten-Lorenz | Customer Operations Engineer
1
- 2. About Me
• COPS Engineer @ Cloudera
• Apache Flume Contributor
• Working with hadoop since 2009
• Blogger (mapredit.blogspot.com)
• Speaker at Conferences / Meetups /
Tooling Events
2 ©2012
Cloudera, Inc. All Rights Reserved.
2
- 3. Flume 1.x
• Mass event collector
• Stream data (events, not files) from clients
to sinks
• Clients: files, syslog, avro, seq, exec
• Sinks: HDFS files, HBase, …
• Configurable routing / topology
3 ©2012
Cloudera, Inc. All Rights Reserved.
3
- 4. Architecture
Component Function
Agent The JVM running Flume. One per machine. Runs
many sources and sinks.
Client Produces data in the form of events. Runs in a
separate thread.
Sink Receives events from a channel. Runs in a separate
thread.
Channel Connects sources to sinks (like a queue).
Implements the reliability semantics.
Event A single datum; a log record, an avro object, etc.
Normally around ~4KB.
4 ©2012
Cloudera, Inc. All Rights Reserved.
4
- 5. Agent
• Runs many clients and sinks
• Java properties-based configuration
• Low overhead (-Xmx20m)
– adding RAM increases performance
– setting Xms prevent in time memory allocation
– Batching increase performance dramatically
5 ©2012
Cloudera, Inc. All Rights Reserved.
5
- 6. Sources
• Plugin interface
• Managed by a SourceRunner that controls
threading and execution model (e.g. polling
vs. event-based)
• Included: exec, avro, syslog, seq
6 ©2012
Cloudera, Inc. All Rights Reserved.
6
- 7. HBase sink
ls -la flume-ng-sinks/flume-ng-hbase-sink/
src/main/java/org/apache/flume/sink/hbase/
HBaseSink.java
HbaseEventSerializer.java
SimpleHbaseEventSerializer.java
SimpleRowKeyGenerator.java
7 ©2012
Cloudera, Inc. All Rights Reserved.
7
- 8. HBaseSink.java
• Control flush()
• Using serializer
• Control the transaction
• Control rollbacks (in case of events couldn’t
written)
8 ©2012
Cloudera, Inc. All Rights Reserved.
8
- 9. Configuration
• Source Seq interface
• Listening on a defined port @localhost
• Serializer need some parameters
• Column family and column must be known
• Valid hbase-site.xml in $CLASSPATH
9 ©2012
Cloudera, Inc. All Rights Reserved.
9
- 10. Configuration Example
host1.sources = src1
host1.sinks = sink1
host1.channels = ch1
host1.sources.src1.type = seq
host1.sources.src1.port = 25001
host1.sources.src1.bind = localhost
host1.sources.src1.channels = ch1
host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink
host1.sinks.sink1.channel = ch1
host1.sinks.sink1.table = test3
host1.sinks.sink1.columnFamily = testing
host1.sinks.sink1.column = foo
host1.sinks.sink1.serializer =
org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
host1.sinks.sink1.serializer.payloadColumn = pcol
host1.sinks.sink1.serializer.incrementColumn = icol
host1.channels.ch1.type=memory
10 ©2012
Cloudera, Inc. All Rights Reserved.
10
- 11. Take Away
• Flume collects events
• Source - Channel - Sink concept
• HBase sink needs a serializer interface
• Column family and column must be known
11 ©2012
Cloudera, Inc. All Rights Reserved.
11
- 12. Thank You
• Web: https://cwiki.apache.org/FLUME/
getting-started.html
• ML: flume-user@incubator.apache.org
• Mail: alexander@cloudera.com
• Blog: mapredit.blogspot.com
• Twitter: @mapredit
12 ©2012
Cloudera, Inc. All Rights Reserved.
12