This document introduces Flume and Flive. It summarizes that Flume is a distributed data collection system that can easily extend to new data formats and scales linearly as new nodes are added. It discusses Flume's core concepts of events, flows, nodes, and reliability features. It then introduces Flive, an enhanced version of Flume developed by Hanborq that provides improved performance, functionality, manageability, and integration with Hugetable.
3. The real world problem
• Changing requirements Extensibility & Manageability
– In the source
– In the path
– In the sink
• Growing scales Scalability
– Volume/nodes keep increasing
• Error prone Reliability
– Network failure
– Service breakdown
4. Flume: the solution to these problems
• Flume is:
– A distributed data collection system
– A streamlined event processing pipeline
– A extensible distributed computation
framework
• Flume answers previous challenges
– Easily extends to new data formats
– Easily adapts new collecting strategies
– Scales linearly as new node added
– Multi level of reliability
– Configurable from shell / web
– Etc.
5. Core Concepts: Flow and Event
• Everything is event – body + meta table
• A flow is a event pipeline from a particular data source
• Flows are comprised of nodes chained together
• Many flows may overlap a physical cluster
6. Core Concepts: Nodes and Plane
• Data plane:
– Path of data flow
– Composited by one or more node in a tiered
architecture
• Two-tier: Agent Collector
• Multi-tier: Agent Processor Collector
• Nodes:
– Nodes have a source and a sink
– Their roles depend on their position in data path
• Masters are in the control plane
– Central control point
– Light weighted since no data plane processing involved
7. Core Concepts: Agent and Collector
• Data plane nodes
– Agent
• receives data from an application
– Processor(optional)
• Intermediate processing
– Collector
• Write data to permanent storage
8. Deploy Topology
• Deploy considerations
– Agents: depend on application data source
– Collectors: depend on targeting storage, network topology,
load balance, etc
9. Considerations on Data Source
• Three integration modes:
– Push: agent as a data collecting service
for data source application
– Pull: agent poll data source periodically
– Embedded: data source application is the
agent itself
10. Data Plane Reliability
• Best effort
– Fire and forget
• Store on failure + retry
– Local acks, local errors detectable
– Failover when faults detected
• End-to-end reliability
– End to end acks
– Data survives compound failures
– At least once
11. Control Plane Reliability
• Master design
– Light-weighted process
• Isolated from data plane processing
– Lazy design
• simply answer a few node requests
• Service availability
– Watch dog
– Multi masters backup
– Service availability between reboot
• Persist configuration data to ZooKeeper
12. Data Plane Scalability
• Data plane is horizontally scalable
– Add collectors to increase availability and to handle more data
• Assumes a single agent will not dominate a collector
• Fewer connections to HDFS.
• Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
– Write log locally to avoid collector disk IO bottleneck and catastrophic
failures
– Compression and batching (trade cpu for network)
– Push computation into the event collection pipeline (balance IO, Mem,
and CPU resource bottlenecks)
13. Data Plane Scalability
• Agents are logically partitioned and send to different
collectors
• Use randomization to pre-specify failovers when many
collectors exist
– Spread load if a collector goes down.
– Spread load if new collectors added to the system.
14. Control Plane Scalability
• A master controls dynamic configurations of nodes
– Uses gossip protocol to keep state consistent
– Scales well for configuration reads
– Allows for adaptive repartitioning in the future
– Nodes can talk to any master.
15. Extensibility
• Extensibility answers to changing use cases
– Invent new connector
• Simple source/sink/decorator APIs
• Plug-in architecture
– Dynamic wired pipeline processing logic
• Many simple operations composes for complex behavior
• Connector
– Sources produce data: plain text files, directory, Log4j, FTP, SQL, …
– Sinks consume data: console, HDFS, local file system
– Decorators modify data sent to sinks
17. Manageability
• Near natural language for node configure
– web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
– web-log-collector : autoCollectorSource
| { regex(“(Firefox|Internet Explorer)”, “browser”) =>
collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• One place to specify node sources, sinks and data flows
– Basic Web interface
– Flume Shell – command line interface
– Extended custom management thru master RPC API
18. Flive – HANBORQ Enhanced Flume
• Based on Flume but with HANBORQ product ecosystem
orientation
• The new HTLoad
• Enhancements:
– Performance
– Functionality
– Manageability
– Hugetable integration
• Compatible with original Flume usage
18
19. Flive – More Than Flume
• Efficiency improvement
– Driving the pipeline
• Native driver is a single thread doing source-pulling and sink-pushing
– Temporal rate mismatch in source and sink may affect each other
• Flive use two threads, one source-pulling and one sink-pushing,
coupled by internal event queue
– Temporal rate variances in source and sink are filtered by the queue
– Contributes 10%~30% throughput improvement
– Introduced node concurrency to maximize target storage
bandwidth
20. Flive – More Than Flume
• Functionality enhancement
– Native Flume connector conf spec syntax is flat
• But connectors are hierarchical essentially
• Limited flat syntax also limits connectors to be flatly assembled
• Assemble connector hierarchy thru hard code, or ad-hoc syntax
– Flive introduced hierarchical syntax
• Hierarchical connector architecture can be dynamically wired
• For backward compatibility, only Flive connector support enhanced
syntax
21. Flive – More Than Flume
• Ease of use
– Zero-configure plug-in architecture
• Native flume mandates handy configure about plugins
• Flive no longer requires any configure but minimal conventions
– Simpler, but yet powerful Flive shell
– Introduced the translator framework
• Node configuration specs may be too complicate to be manually edited
• Translator helps translate user domain spec to Flive/Flume configuration
spec
• Extendable
– Hugetable translator for Hugetable
– Basic translator for native Flume – full Flume compatibility
– Ease of deploy and management
22. Flive – More Than Flume
• As a Hugetable ETL
– Sourcing structured data from various sources
• FS, FTP, SQL, LOG4J, …
– Targeting all Hugetable storage engine
• Text File, Sequence File, RCFile, HFile, HBase,…
– Filtering unwanted/malformed records
– Column transfer over the air
• IUD like single stream column op: based on function expression
• Multi stream op: pre-join in the fly
– Multi table loading
• Like fan-out but less overhead
– Real time aggregation
• Accurate computation: sum(x), count(*)
• Probabilistic computation: count(distinct x), top(k), etc.