Chicago Data Summit: Flume: An Introduction

Flume Logging for the Enterprise Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer Cloudera, Inc Chicago Data Summit, 4/26/11

Who Am I? Cloudera: Software Engineer on the Platform Team Flume Project Lead / Designer / Architect U of Washington: “On Leave” from PhD program Research in Systems and Programming Languages Previously: Computer Security, Embedded Systems. 3 Jonathan Hsieh, Chicago Data Summit 4/26/2011

An Enterprise Scenario You have a bunch of departments with servers generating log files. You are required keep logs and want to analyze and profit from them. Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop. … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS. Jonathan Hsieh, Chicago Data Summit 4/26/2011 4 It’s log, log .. Everyone wants a log!

Ad-hoc gets complicated Black box? What happens if the person who wrote it leaves? Unextensible? Is it one-off or flexible enough to handle future needs? Unmanageable? Do you know when something goes wrong? Unreliable? If something goes wrong, will it recover? Unscalable? Hit a ingestion rate limit? Jonathan Hsieh, Chicago Data Summit 4/26/2011 5

Cloudera Flume Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing. Project Goals: Scalability Reliability Extensibility Manageability Openness 6 Jonathan Hsieh, Chicago Data Summit 4/26/2011

The Canonical Use Case HDFS 7 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011

The Canonical Use Case HDFS Flume Agent server 8 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011

The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server 9 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011

The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server Agent server 10 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011

Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 11 node Agent sink source node Collector sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011

Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 12 node sink source node sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011

Outline What is Flume? Scalability Horizontal scalability of all nodes and masters Reliability Fault-tolerance and High availability Extensibility Unix principle, all kinds of data, all kinds of sources, all kinds of sinks Manageability Centralized management supporting dynamic reconfiguration Openness Apache v2.0 License and an active and growing community 13 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Scalability 14 Jonathan Hsieh, Chicago Data Summit 4/26/2011

The Canonical Use Case HDFS Flume Agent server 15 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011

Data path is horizontally scalable Add collectors to increase availability and to handle more data Assumes a single agent will not dominate a collector Fewer connections to HDFS that tax the resource constrained NameNode Larger more efficient writes to HDFS and fewer files avoids “small file problem” Simplifies security story when supporting Kerborized HDFS or protected production servers. ,[object Object],Write log locally to avoid collector disk IO bottleneck and catastrophic failures Compression and batching (trade cpu for network) Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) 16 HDFS Agent server Agent Collector server Agent server Agent server Jonathan Hsieh, Chicago Data Summit 4/26/2011

Node scalability limits and optimization plans 17 HDFS Agent server Agent Collector server Agent server Agent server In most deployments today, a single collector is not saturated. The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage. Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach: 3-5x by batching to get to wire/disk limit (trade latency for throughput) 5-10x by compression to trade CPU for throughput (logs highly compressible) The limit is probably in the ball park of 40 effective TB/day/collector. Jonathan Hsieh, Chicago Data Summit 4/26/2011

Control plane is horizontally scalable A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 18 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011

Reliability 19 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Failures Faults can happen at many levels Software applications can fail Machines can fail Networking gear can fail Excessive networking congestion or machine load A node goes down for maintenance. How do we make sure that events make it to a permanent store? 20 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Tunable failure recovery modes HDFS HDFS HDFS Best effort Fire and forget Store on failure + retry Writes to disk on detected failure. One-hop TCP acks Failover when faults detected. End-to-end reliability Write ahead log on agent Checksums and End-to-end acks Data survives compound failures, and may be retried multiple times Agent Collector Collector Agent Collector Agent 21 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Load balancing 22 Agent ,[object Object]

Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011

Load balancing and collector failover Agent ,[object Object]

Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. 23 Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011

Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 24 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011

Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Master ZK3 Master 25 Node Node Jonathan Hsieh, Chicago Data Summit 4/26/2011

Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 26 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011

Extensibility 27 Jonathan Hsieh, Chicago Data Summit 4/26/2011

sink sink Flume is easy to extend Simple source and sink APIs An event streaming design Many simple operations composes for complex behavior Plug-in architecture so you can add your own sources, sinks and decorators and sinks 28 sink source deco fanout deco source deco Jonathan Hsieh, Chicago Data Summit 4/26/2011

Variety of Connectors Sources produce data Console, Exec, Syslog, Scribe, IRC, Twitter, In the works: JMS, AMQP, pubsubhubbub/RSS/Atom Sinks consume data Console, Local files, HDFS, S3 Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search In the works: JMS, AMQP Decorators modify data sent to sinks Wire batching, compression, sampling, projection, extraction, throughput throttling Custom near real-time processing (Meebo) JRuby event modifiers (InfoChimps) Cryptographic extensions(Rearden) Streaming SQL in-stream-analytics system FlumeBase (Aaron Kimball) 29 source sink deco Jonathan Hsieh, Chicago Data Summit 4/26/2011

Migrating previous enterprise architecture 30 HDFS filer HDFS HDFS Flume Collector Agent poller Msg bus Flume Flume Agent amqp Collector Custom app Collector Agent avro Jonathan Hsieh, Chicago Data Summit 4/26/2011

Data ingestion pipeline pattern 31 HBase Incremental Search Idx HDFS Flume Agent Hive query Agent Agent Collector Fanout index hbase hdfs Agent svr Pig query Key lookup Range query Search query Faceted query Jonathan Hsieh, Chicago Data Summit 4/26/2011

Manageability Wheeeeee! 32 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Configuring Flume Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ; A concise and precise configuration language for specifying dataflows in a node. Dynamic updates of configurations Allows for live failover changes Allows for handling newly provisioned machines Allows for changing analytics 33 tail filter fanout roll hdfs console Jonathan Hsieh, Chicago Data Summit 4/26/2011

Output bucketing Automatic output file management Write hdfs files in over time based tags 34 HDFS Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collector node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) Jonathan Hsieh, Chicago Data Summit 4/26/2011

Configuration is straightforward node001: tail(“/var/log/app/log”) | autoE2ESink; node002: tail(“/var/log/app/log”) | autoE2ESink; … node100: tail(“/var/log/app/log”) | autoE2ESink; collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) 35 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Centralized Dataflow Management Interfaces One place to specify node sources, sinks and data flows. Basic Web interface Flume Shell Command line interface Scriptable Cloudera Enterprise Flume Monitor App Graphical web interface 36 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Enterprise Friendly Integrated as part of CDH3 and Cloudera Enterprise RPM and DEB packaging for enterprise Linux Flume Node for Windows (beta) Cloudera Enterprise Support 24-7 Support SLAs Professional Services Cloudera Flume Features for Enterprises Kerberos Authentication support for writing to “secure” HDFS Detailed JSON-exposed metrics for monitoring integration (beta) Log4J collection (beta) High Availability via Multiple Master (alpha) Encrypted SSL / TLS data path and control path support (dev) Jonathan Hsieh, Chicago Data Summit 4/26/2011 37

An enterprise story 38 Kerberos HDFS Flume Collector tier Agent api Agent Collector api Agent api Win api Department Servers Agent api Agent Collector api Agent api Linux api D D D D D D Agent api Agent Collector api Agent api Linux api Active Directory / LDAP Jonathan Hsieh, Chicago Data Summit 4/26/2011

Openness And Community 39 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Flume is Open Source Apache v2.0 Open Source License Independent from Apache Software Foundation You have the right to fork or modify the software GitHub source code repository http://github.com/cloudera/flume Regular tarball update versions every 2-3 months. Regular CDH packaging updates every 3-4 months. Always looking for contributors and committors 40 Jonathan Hsieh, Chicago Data Summit 4/26/2011

Growing user and developer community 41 ,[object Object]

Lots of innovation comes from community

Community folks are willing to tryincomplete features.

Early feedback and community fixes

Many interesting topologies in the communityJonathan Hsieh, Chicago Data Summit 4/26/2011

: Multi Datacenter 42 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011

: Multi Datacenter 43 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Relay Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011

: Near Real-time Aggregator 44 HDFS DB Flume Agent Ad svr Collector Tracker Agent Ad svr Agent Ad svr Agent Ad svr quick reports Hive job verify reports Jonathan Hsieh, Chicago Data Summit 4/26/2011

Chicago Data Summit: Flume: An Introduction

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Chicago Data Summit: Flume: An Introduction

Similaire à Chicago Data Summit: Flume: An Introduction (20)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Dernier

Dernier (20)

Chicago Data Summit: Flume: An Introduction