SlideShare une entreprise Scribd logo
1  sur  50
Flume Logging for the Enterprise Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer Cloudera, Inc Chicago Data Summit, 4/26/11
Who Am I? Cloudera: Software Engineer on the Platform Team Flume Project Lead / Designer / Architect U of Washington: “On Leave” from PhD program Research in Systems and Programming Languages Previously:  Computer Security, Embedded Systems.	 3 Jonathan Hsieh, Chicago Data Summit  4/26/2011
An Enterprise Scenario You have a bunch of departments with servers generating log files. You are required keep logs and want to analyze and profit from them. Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop. … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS. Jonathan Hsieh, Chicago Data Summit  4/26/2011 4 It’s log, log .. Everyone wants a log!
Ad-hoc gets complicated Black box? What happens if the person who wrote it leaves? Unextensible? Is it one-off or flexible enough to handle future needs? Unmanageable? Do you know when something goes wrong? Unreliable? If something goes wrong, will it recover? Unscalable? Hit a ingestion rate limit? Jonathan Hsieh, Chicago Data Summit  4/26/2011 5
Cloudera Flume Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing. Project Goals: Scalability Reliability Extensibility Manageability Openness 6 Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS 7 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Agent server 8 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server 9 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server Agent server 10 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
Flume’s Key Abstractions Data path and control path Nodes are in the data path  Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks  Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 11 node Agent   sink source node Collector   sink source Master Jonathan Hsieh, Chicago Data Summit  4/26/2011
Flume’s Key Abstractions Data path and control path Nodes are in the data path  Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks  Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 12 node   sink source node   sink source Master Jonathan Hsieh, Chicago Data Summit  4/26/2011
Outline What is Flume? Scalability Horizontal scalability of all nodes and masters Reliability Fault-tolerance and High availability  Extensibility Unix principle, all kinds of data, all kinds of sources, all kinds of sinks Manageability Centralized management supporting dynamic reconfiguration  Openness Apache v2.0 License and an active and growing community 13 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Scalability 14 Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Agent server 15 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
Data path is horizontally scalable Add collectors to increase availability and to handle more data Assumes a single agent will not dominate a collector Fewer connections to HDFS that tax the resource constrained NameNode Larger more efficient writes to HDFS and fewer files avoids “small file problem” Simplifies security story when supporting Kerborized HDFS or protected production servers. ,[object Object],Write log locally to avoid collector disk IO bottleneck and catastrophic failures Compression and batching  (trade cpu for network) Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) 16 HDFS Agent server Agent Collector server Agent server Agent server Jonathan Hsieh, Chicago Data Summit  4/26/2011
Node scalability limits and optimization plans 17 HDFS Agent server Agent Collector server Agent server Agent server In most deployments today, a single collector is not saturated.  The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage. Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach: 3-5x by batching to get to wire/disk limit (trade latency for throughput) 5-10x  by compression to trade CPU for throughput (logs highly compressible) The limit is probably in the ball park of 40 effective TB/day/collector. Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is horizontally scalable A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 18 Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Reliability 19 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Failures Faults can happen at many levels Software applications can fail Machines can fail Networking gear can fail Excessive networking congestion or machine load A node goes down for maintenance. How do we make sure that events make it to a permanent store? 20 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Tunable failure recovery modes HDFS HDFS HDFS Best effort Fire and forget Store on failure + retry Writes to disk on detected failure. One-hop TCP acks Failover when faults detected.  End-to-end reliability Write ahead log on agent Checksums and End-to-end acks Data survives compound failures, and may be retried multiple times Agent Collector Collector Agent Collector Agent 21 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Load balancing 22 Agent ,[object Object]
Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit  4/26/2011
Load balancing and collector failover Agent ,[object Object]
Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. 23 Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 24 Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Master ZK3 Master 25 Node Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 26 Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Extensibility 27 Jonathan Hsieh, Chicago Data Summit  4/26/2011
sink sink Flume is easy to extend Simple source and sink APIs An event streaming design Many simple operations composes for complex behavior Plug-in architecture so you can add your own sources, sinks and decorators and sinks 28 sink source deco fanout deco source deco Jonathan Hsieh, Chicago Data Summit  4/26/2011
Variety of Connectors Sources produce data Console, Exec, Syslog, Scribe, IRC, Twitter,  In the works: JMS, AMQP, pubsubhubbub/RSS/Atom Sinks consume data Console, Local files, HDFS, S3 Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search In the works: JMS, AMQP Decorators modify data sent to sinks Wire batching, compression, sampling, projection, extraction, throughput throttling Custom near real-time processing  (Meebo) JRuby event modifiers (InfoChimps) Cryptographic extensions(Rearden) Streaming SQL in-stream-analytics system FlumeBase (Aaron Kimball) 29 source sink deco Jonathan Hsieh, Chicago Data Summit  4/26/2011
Migrating previous enterprise architecture 30 HDFS filer HDFS HDFS Flume Collector Agent poller Msg bus Flume Flume Agent amqp Collector Custom app Collector Agent avro Jonathan Hsieh, Chicago Data Summit  4/26/2011
Data ingestion pipeline pattern 31 HBase Incremental Search Idx HDFS Flume Agent Hive query Agent Agent Collector Fanout index hbase hdfs Agent svr Pig query Key lookup Range query Search query Faceted query Jonathan Hsieh, Chicago Data Summit  4/26/2011
Manageability Wheeeeee! 32 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Configuring Flume Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ; A concise and precise configuration language for specifying dataflows in a node. Dynamic updates of configurations Allows for live failover changes Allows for handling newly provisioned machines Allows for changing analytics 33 tail filter fanout roll hdfs console Jonathan Hsieh, Chicago Data Summit  4/26/2011
Output bucketing Automatic output file management  Write hdfs files in over time based tags 34 HDFS Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collector node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) Jonathan Hsieh, Chicago Data Summit  4/26/2011
Configuration is straightforward node001: tail(“/var/log/app/log”) | autoE2ESink; node002: tail(“/var/log/app/log”) | autoE2ESink; … node100: tail(“/var/log/app/log”) | autoE2ESink; collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) 35 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Centralized Dataflow Management Interfaces One place to specify node sources, sinks and data flows. Basic Web interface   Flume Shell Command line interface Scriptable  Cloudera Enterprise Flume Monitor App Graphical web interface 36 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Enterprise Friendly Integrated as part of CDH3 and Cloudera Enterprise RPM and DEB packaging for enterprise Linux Flume Node for Windows (beta) Cloudera Enterprise Support  24-7 Support SLAs Professional Services Cloudera Flume Features for Enterprises Kerberos Authentication support for writing to “secure” HDFS Detailed JSON-exposed metrics for monitoring integration (beta) Log4J collection (beta) High Availability via Multiple Master (alpha) Encrypted SSL / TLS data path and control path support (dev) Jonathan Hsieh, Chicago Data Summit  4/26/2011 37
An enterprise story 38 Kerberos HDFS Flume Collector tier Agent api Agent Collector api Agent api Win api Department Servers Agent api Agent Collector api Agent api Linux api D D D D D D Agent api Agent Collector api Agent api Linux api Active Directory  / LDAP Jonathan Hsieh, Chicago Data Summit  4/26/2011
Openness And Community 39 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Flume is Open Source Apache v2.0 Open Source License  Independent from Apache Software Foundation You have the right to fork or modify the software GitHub source code repository http://github.com/cloudera/flume Regular tarball update versions every 2-3 months. Regular CDH packaging updates every 3-4 months. Always looking for contributors and committors 40 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Growing user and developer community  41 ,[object Object]
Lots of innovation comes from community
Community folks are willing to tryincomplete features.
Early feedback and community fixes
Many interesting topologies in the communityJonathan Hsieh, Chicago Data Summit  4/26/2011
                       : Multi Datacenter 42 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit  4/26/2011
                       : Multi Datacenter 43 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Relay Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit  4/26/2011
             : Near Real-time Aggregator 44 HDFS DB Flume Agent Ad svr Collector Tracker  Agent Ad svr Agent Ad svr Agent Ad svr quick reports Hive job verify reports Jonathan Hsieh, Chicago Data Summit  4/26/2011

Contenu connexe

Tendances

High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
DataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 

Tendances (20)

Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmf
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clusters
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureDevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
 
Challenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the CloudChallenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the Cloud
 
Drop the Pressure on your Production Server
Drop the Pressure on your Production ServerDrop the Pressure on your Production Server
Drop the Pressure on your Production Server
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
 
SCU 2015 - Hyper-V Replica
SCU 2015 - Hyper-V ReplicaSCU 2015 - Hyper-V Replica
SCU 2015 - Hyper-V Replica
 
Texter blue - gdpr watchdog
Texter blue - gdpr watchdogTexter blue - gdpr watchdog
Texter blue - gdpr watchdog
 
hadoop architecture -Big data hadoop
   hadoop architecture -Big data hadoop   hadoop architecture -Big data hadoop
hadoop architecture -Big data hadoop
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
SVC / Storwize: cache partition analysis (BVQ howto)
SVC / Storwize: cache partition analysis  (BVQ howto)   SVC / Storwize: cache partition analysis  (BVQ howto)
SVC / Storwize: cache partition analysis (BVQ howto)
 
XS 2008 Boston Capacity Planning
XS 2008 Boston Capacity PlanningXS 2008 Boston Capacity Planning
XS 2008 Boston Capacity Planning
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
XS Oracle 2009 Just Run It
XS Oracle 2009 Just Run ItXS Oracle 2009 Just Run It
XS Oracle 2009 Just Run It
 
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
 
Yeti DNS Project
Yeti DNS ProjectYeti DNS Project
Yeti DNS Project
 

En vedette

Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
skaluska
 

En vedette (17)

Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010
 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 

Similaire à Chicago Data Summit: Flume: An Introduction

An Open Source Case Study
An Open Source Case StudyAn Open Source Case Study
An Open Source Case Study
webhostingguy
 
Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)
Sri Prasanna
 
Oracle 10g rac_overview
Oracle 10g rac_overviewOracle 10g rac_overview
Oracle 10g rac_overview
Robel Parvini
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
Dan Frincu
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
Amdocs
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
Amdocs
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
OpenSourceIndia
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
suniltomar04
 

Similaire à Chicago Data Summit: Flume: An Introduction (20)

Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Hadoop
HadoopHadoop
Hadoop
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
An Open Source Case Study
An Open Source Case StudyAn Open Source Case Study
An Open Source Case Study
 
Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)
 
Is 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingIs 12 Factor App Right About Logging
Is 12 Factor App Right About Logging
 
Oracle 10g rac_overview
Oracle 10g rac_overviewOracle 10g rac_overview
Oracle 10g rac_overview
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Oracle Coherence
Oracle CoherenceOracle Coherence
Oracle Coherence
 
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh KhươngKiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
Performance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memoryPerformance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memory
 

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Chicago Data Summit: Flume: An Introduction

  • 1.
  • 2. Flume Logging for the Enterprise Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer Cloudera, Inc Chicago Data Summit, 4/26/11
  • 3. Who Am I? Cloudera: Software Engineer on the Platform Team Flume Project Lead / Designer / Architect U of Washington: “On Leave” from PhD program Research in Systems and Programming Languages Previously: Computer Security, Embedded Systems. 3 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 4. An Enterprise Scenario You have a bunch of departments with servers generating log files. You are required keep logs and want to analyze and profit from them. Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop. … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS. Jonathan Hsieh, Chicago Data Summit 4/26/2011 4 It’s log, log .. Everyone wants a log!
  • 5. Ad-hoc gets complicated Black box? What happens if the person who wrote it leaves? Unextensible? Is it one-off or flexible enough to handle future needs? Unmanageable? Do you know when something goes wrong? Unreliable? If something goes wrong, will it recover? Unscalable? Hit a ingestion rate limit? Jonathan Hsieh, Chicago Data Summit 4/26/2011 5
  • 6. Cloudera Flume Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing. Project Goals: Scalability Reliability Extensibility Manageability Openness 6 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 7. The Canonical Use Case HDFS 7 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 8. The Canonical Use Case HDFS Flume Agent server 8 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 9. The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server 9 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 10. The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server Agent server 10 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 11. Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 11 node Agent sink source node Collector sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 12. Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 12 node sink source node sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 13. Outline What is Flume? Scalability Horizontal scalability of all nodes and masters Reliability Fault-tolerance and High availability Extensibility Unix principle, all kinds of data, all kinds of sources, all kinds of sinks Manageability Centralized management supporting dynamic reconfiguration Openness Apache v2.0 License and an active and growing community 13 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 14. Scalability 14 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 15. The Canonical Use Case HDFS Flume Agent server 15 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 16.
  • 17. Node scalability limits and optimization plans 17 HDFS Agent server Agent Collector server Agent server Agent server In most deployments today, a single collector is not saturated. The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage. Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach: 3-5x by batching to get to wire/disk limit (trade latency for throughput) 5-10x by compression to trade CPU for throughput (logs highly compressible) The limit is probably in the ball park of 40 effective TB/day/collector. Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 18. Control plane is horizontally scalable A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 18 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 19. Reliability 19 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 20. Failures Faults can happen at many levels Software applications can fail Machines can fail Networking gear can fail Excessive networking congestion or machine load A node goes down for maintenance. How do we make sure that events make it to a permanent store? 20 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 21. Tunable failure recovery modes HDFS HDFS HDFS Best effort Fire and forget Store on failure + retry Writes to disk on detected failure. One-hop TCP acks Failover when faults detected. End-to-end reliability Write ahead log on agent Checksums and End-to-end acks Data survives compound failures, and may be retried multiple times Agent Collector Collector Agent Collector Agent 21 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 22.
  • 23. Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 24.
  • 25. Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. 23 Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 26. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 24 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 27. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Master ZK3 Master 25 Node Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 28. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 26 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 29. Extensibility 27 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 30. sink sink Flume is easy to extend Simple source and sink APIs An event streaming design Many simple operations composes for complex behavior Plug-in architecture so you can add your own sources, sinks and decorators and sinks 28 sink source deco fanout deco source deco Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 31. Variety of Connectors Sources produce data Console, Exec, Syslog, Scribe, IRC, Twitter, In the works: JMS, AMQP, pubsubhubbub/RSS/Atom Sinks consume data Console, Local files, HDFS, S3 Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search In the works: JMS, AMQP Decorators modify data sent to sinks Wire batching, compression, sampling, projection, extraction, throughput throttling Custom near real-time processing (Meebo) JRuby event modifiers (InfoChimps) Cryptographic extensions(Rearden) Streaming SQL in-stream-analytics system FlumeBase (Aaron Kimball) 29 source sink deco Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 32. Migrating previous enterprise architecture 30 HDFS filer HDFS HDFS Flume Collector Agent poller Msg bus Flume Flume Agent amqp Collector Custom app Collector Agent avro Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 33. Data ingestion pipeline pattern 31 HBase Incremental Search Idx HDFS Flume Agent Hive query Agent Agent Collector Fanout index hbase hdfs Agent svr Pig query Key lookup Range query Search query Faceted query Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 34. Manageability Wheeeeee! 32 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 35. Configuring Flume Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ; A concise and precise configuration language for specifying dataflows in a node. Dynamic updates of configurations Allows for live failover changes Allows for handling newly provisioned machines Allows for changing analytics 33 tail filter fanout roll hdfs console Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 36. Output bucketing Automatic output file management Write hdfs files in over time based tags 34 HDFS Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collector node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 37. Configuration is straightforward node001: tail(“/var/log/app/log”) | autoE2ESink; node002: tail(“/var/log/app/log”) | autoE2ESink; … node100: tail(“/var/log/app/log”) | autoE2ESink; collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) 35 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 38. Centralized Dataflow Management Interfaces One place to specify node sources, sinks and data flows. Basic Web interface Flume Shell Command line interface Scriptable Cloudera Enterprise Flume Monitor App Graphical web interface 36 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 39. Enterprise Friendly Integrated as part of CDH3 and Cloudera Enterprise RPM and DEB packaging for enterprise Linux Flume Node for Windows (beta) Cloudera Enterprise Support 24-7 Support SLAs Professional Services Cloudera Flume Features for Enterprises Kerberos Authentication support for writing to “secure” HDFS Detailed JSON-exposed metrics for monitoring integration (beta) Log4J collection (beta) High Availability via Multiple Master (alpha) Encrypted SSL / TLS data path and control path support (dev) Jonathan Hsieh, Chicago Data Summit 4/26/2011 37
  • 40. An enterprise story 38 Kerberos HDFS Flume Collector tier Agent api Agent Collector api Agent api Win api Department Servers Agent api Agent Collector api Agent api Linux api D D D D D D Agent api Agent Collector api Agent api Linux api Active Directory / LDAP Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 41. Openness And Community 39 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 42. Flume is Open Source Apache v2.0 Open Source License Independent from Apache Software Foundation You have the right to fork or modify the software GitHub source code repository http://github.com/cloudera/flume Regular tarball update versions every 2-3 months. Regular CDH packaging updates every 3-4 months. Always looking for contributors and committors 40 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 43.
  • 44. Lots of innovation comes from community
  • 45. Community folks are willing to tryincomplete features.
  • 46. Early feedback and community fixes
  • 47. Many interesting topologies in the communityJonathan Hsieh, Chicago Data Summit 4/26/2011
  • 48. : Multi Datacenter 42 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 49. : Multi Datacenter 43 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Relay Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 50. : Near Real-time Aggregator 44 HDFS DB Flume Agent Ad svr Collector Tracker Agent Ad svr Agent Ad svr Agent Ad svr quick reports Hive job verify reports Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 51. Community Support Community-based mailing lists for support “an answer in a few days” User: https://groups.google.com/a/cloudera.org/group/flume-user Dev: https://groups.google.com/a/cloudera.org/group/flume-dev Community-based IRC chat room “quick questions, quick answers” #flume in irc.freenode.net 45 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 52. Conclusions 46 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 53. Summary Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs. It is centrally managed, which allows for automated and adaptive configurations. This design allows for near-real time processing. Apache v2.0 License with active and growing community. Part of Cloudera’s Distribution including Apache Hadoop updated for CDH3u0 and Cloudera Enterprise. Several CDH users in community in production use. Several Cloudera Enterprise customers evaluating for production use. 47 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 54. Related systems Remote Syslogng / rsyslog / syslog Best effort. If server down, messages lost. Chukwa – Yahoo! / Apache Incubator Designed as a monitoring system for Hadoop. Minibatches, requires MapReduce batch processing to demultiplex data. New HBase dependent path One of the core contributors (Ari) currently works at Cloudera (not on Chukwa) Scribe - Facebook Only durable-on-failure reliability mechanisms. Collector disk is the bottleneck. Little visibility into system performance. Little support or documentation. Most scribe deploys replaced by “Data Freeway” Kafka - LinkedIn New system by LinkedIn. Pull model. Interesting, written in Scala 48 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 55. Questions? Contact info: jon@cloudera.com Twitter @jmhsieh 49 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 56.
  • 57. Flow Isolation Isolate different kinds of data when and where it is generated Have multiple logical nodes on a machine Each has their own data source Each has their own data sink 51 Agent Collector Agent Collector Agent Collector Agent Collector Collector Agent Agent Collector Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 58. Isolate different kinds of data when and where it is generated Have multiple logical nodes on a machine Each has their own data source Each has their own data sink Flow Isolation 52 Agent Collector Agent Agent Agent Collector Agent Agent Collector Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 59. Image credits http://www.flickr.com/photos/victorvonsalza/3327750057/ http://www.flickr.com/photos/victorvonsalza/3207639929/ http://www.flickr.com/photos/victorvonsalza/3327750059/ http://www.emvergeoning.com/?m=200811 http://www.flickr.com/photos/juse/188960076/ http://www.flickr.com/photos/juse/188960076/ http://www.flickr.com/photos/23720661@N08/3186507302/ http://clarksoutdoorchairs.com/log_adirondack_chairs.html http://www.flickr.com/photos/dboo/3314299591/ 53 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 60. Master Service Failures An master machine should not be the single point of failure! Masters keep two kinds of information: Configuration information (node/flow configuration) Kept in ZooKeeper ensemble for persistent, highly available metadata store Failures easily recovered from Ephemeral information (heartbeat info, acks, metrics reports) Kept in memory Failures will lose data This information can be lazily replicated 54 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 61. Dealing with Agent failures We do not want to lose data Make events durable at the generation point. If a log generator goes down, it is not generating logs. If the event generation point fails and recovers, data will reach the end point Data is durable and survive if machines crashes and reboots Allows for synchronous writes in log generating applications. Watchdog program to restart agent if it fails. 55 Jonathan Hsieh, Chicago Data Summit 4/26/2011