SlideShare a Scribd company logo
1 of 26
1
Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Real Time Data Ingest into Hadoop
using Flume
Hari Shreedharan
Committer and PMC Member, Apache Flume
Software Engineer, Cloudera
What is Flume
• Collection, Aggregation of streaming Event Data
• Typically used for log data
• Significant advantages over ad-hoc solutions
• Reliable, Scalable, Manageable, Customizable and High
Performance
• Declarative, Dynamic Configuration
• Contextual Routing
• Feature rich
• Fully extensible
2
Core Concepts: Event
An Event is the fundamental unit of data transported by Flume
from its point of origination to its final destination. Event is a
byte array payload accompanied by optional headers.
• Payload is opaque to Flume
• Headers are specified as an unordered collection of string key-
value pairs, with keys being unique across the collection
• Headers can be used for contextual routing
3
Core Concepts: Client
An entity that generates events and sends them to one or more Agents.
• Example
• Flume log4j Appender
• Custom Client using Client SDK (org.apache.flume.api)
• Embedded Agent – An agent embedded within your application
• Decouples Flume from the system where event data is consumed
from
• Not needed in all cases
4
Core Concepts: Agent
A container for hosting Sources, Channels, Sinks and other
components that enable the transportation of events from one
place to another.
• Fundamental part of a Flume flow
• Provides Configuration, Life-Cycle Management, and
Monitoring Support for hosted components
5
Typical Aggregation Flow
6
[Client]+  Agent [ Agent]*  Destination
Core Concepts: Source
An active component that receives events from a specialized
location or mechanism and places it on one or Channels.
• Different Source types:
• Specialized sources for integrating with well-known systems.
Example: Syslog, Netcat
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro
• Require at least one channel to function
7
Sources
• Different Source types:
• Specialized sources for integrating with well-known systems.
Example: Spooling Files, Syslog, Netcat, JMS
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro, Thrift
• Require at least one channel to function
8
Core Concepts: Channel
A passive component that buffers the incoming events until they are
drained by Sinks.
• Different Channels offer different levels of persistence:
• Memory Channel: volatile
• Data lost if JVM or machine restarts
• File Channel: backed by WAL implementation
• Data not lost unless the disk dies.
• Eventually, when the agent comes back data can be accessed.
• Channels are fully transactional
• Provide weak ordering guarantees
• Can work with any number of Sources and Sinks.
9
Core Concepts: Sink
An active component that removes events from a Channel and
transmits them to their next hop destination.
• Different types of Sinks:
• Terminal sinks that deposit events to their final destination. For
example: HDFS, HBase, Morphline-Solr, Elastic Search
• Sinks support serialization to user’s preferred formats.
• HDFS sink supports time-based and arbitrary bucketing of data while
writing to HDFS.
• IPC sink for Agent-to-Agent communication: Avro, Thrift
• Require exactly one channel to function
10
Flow Reliability
Reliability based on:
• Transactional Exchange between Agents
• Persistence Characteristics of Channels in the Flow
Also Available:
• Built-in Load balancing Support
• Built-in Failover Support
11
Flow Reliability
Normal Flow
Communication Failure between Agents
Communication Restored, Flow back to Normal
12
Flow Handling
Channels decouple impedance of upstream and downstream
• Upstream burstiness is damped by channels
• Downstream failures are transparently absorbed by channels
 Sizing of channel capacity is key in realizing these benefits
13
Configuration
• Java Properties File Format
# Comment line
key1 = value
key2 = multi-line 
value
• Hierarchical, Name Based Configuration
agent1.channels.myChannel.type = FILE
agent1.channels.myChannel.capacity = 1000
• Uses soft references for establishing associations
agent1.sources.mySource.type = HTTP
agent1.sources.mySource.channels = myChannel
14
Configuration
Typical Deployment
• All agents in a specific
tier could be given the
same name
• One configuration file
with entries for three
agents can be used
throughout
15
Contextual Routing
Achieved using Interceptors and Channel Selectors
16
Interceptors
Interceptor
An Interceptor is a component applied to a source in pre-specified order
to enable decorating and filtering of events where necessary.
• Built-in Interceptors allow adding headers such as timestamps,
hostname, static markers etc.
• Custom interceptors can introspect event payload to create specific
headers where necessary
17
Contextual Routing
Channel Selector
A Channel Selector allows a Source to select one or more
Channels from all the Channels that the Source is configured with
based on preset criteria.
• Built-in Channel Selectors:
• Replicating: for duplicating the events
• Multiplexing: for routing based on headers
18
Contextual Routing
• Terminal Sinks can directly use Headers to make destination
selections
• HDFS Sink can use headers values to create dynamic path for files
that the event will be added to.
• Some headers such as timestamps can be used in a more
sophisticated manner
• Custom Channel Selector can be used for doing specialized
routing where necessary
19
Load Balancing and Failover
Sink Processor
A Sink Processor is responsible for invoking one sink from an
assigned group of sinks.
• Built-in Sink Processors:
• Load Balancing Sink Processor – using RANDOM, ROUND_ROBIN
or Custom selection algorithm
• Failover Sink Processor
• Default Sink Processor
20
Sink Processor
• Invoked by Sink Runner
• Acts as a proxy for a Sink
21
Sink Processor Configuration
• A Sink can exist in at most one group
• A Sink that is not in any group is handled via Default Sink
Processor
• Caution:
Removing a Sink Group does not make the sinks inactive!
22
Client API
• Simple API that can be used to send data to Flume agents.
• Simplest form – send a batch of events to one agent.
• Can be used to send data to multiple agents in a round-robin,
random or failover fashion (send data to one till it fails).
• Java only.
• flume.thrift can be used to generate code for other
languages.
• Use with Thrift source.
23
Embedded Agent
• Client API throws exceptions if data could not be sent.
• Applications may not be able to tolerate this.
• Embedded Agent – A (limited) Flume agent within your application
• Has a channel – so buffers data in-memory or on-disk till the data is
sent or the channel is full.
• Throws exceptions only if the channel is full (or error writing to
channel).
• Cushion for application if something causes data to be stuck within
the application
• Supports sending data to other Flume agents only, no HDFS, HBase
etc.
24
Summary
• Clients send Events to Agents
• Agents hosts number Flume components – Source, Interceptors, Channel
Selectors, Channels, Sink Processors, and Sinks.
• Sources and Sinks are active components, where as Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not
filtered, puts them on channel(s) selected by the configured Channel
Selector
• Sink Processor identifies a sink to invoke, that can take Events from a
Channel and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery
semantics
• Channel persistence allows for ensuring end-to-end reliability
25
26
Questions?

More Related Content

What's hot

AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018
AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018
AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018
Amazon Web Services Korea
 
[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...
[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...
[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...
confluent
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
Apache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army KnifeApache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army Knife
DataWorks Summit
 

What's hot (20)

KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
KSQL and Security: The Current State of Affairs (Victoria Xia, Confluent) Kaf...
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
 
AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018
AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018
AWS Support에서 제안하는 멋진 클라우드 아키텍처 디자인::조성열:: AWS Summit Seoul 2018
 
[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...
[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...
[Confluent] 실시간 하이브리드, 멀티 클라우드 데이터 아키텍처로 빠르게 혀...
 
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive실시간 스트리밍 분석  Kinesis Data Analytics Deep Dive
실시간 스트리밍 분석 Kinesis Data Analytics Deep Dive
 
Hands-on with AWS Security Hub - FND213-R - AWS re:Inforce 2019
 Hands-on with AWS Security Hub - FND213-R - AWS re:Inforce 2019  Hands-on with AWS Security Hub - FND213-R - AWS re:Inforce 2019
Hands-on with AWS Security Hub - FND213-R - AWS re:Inforce 2019
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Apache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army KnifeApache Knox - Hadoop Security Swiss Army Knife
Apache Knox - Hadoop Security Swiss Army Knife
 
Deep Dive: AWS CloudFormation
Deep Dive: AWS CloudFormationDeep Dive: AWS CloudFormation
Deep Dive: AWS CloudFormation
 
Azure Messaging Services #1
Azure Messaging Services #1Azure Messaging Services #1
Azure Messaging Services #1
 
AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2AWS Modern Infra with Storage Roadshow 2023 - Day 2
AWS Modern Infra with Storage Roadshow 2023 - Day 2
 
Introduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matterIntroduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matter
 
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
[AWS & 베스핀글로벌, 바이오∙헬스케어∙제약사를 위한 세미나] AWS 클라우드 보안
[AWS & 베스핀글로벌, 바이오∙헬스케어∙제약사를 위한 세미나] AWS 클라우드 보안[AWS & 베스핀글로벌, 바이오∙헬스케어∙제약사를 위한 세미나] AWS 클라우드 보안
[AWS & 베스핀글로벌, 바이오∙헬스케어∙제약사를 위한 세미나] AWS 클라우드 보안
 

Viewers also liked

Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagram
Wooseung Kim
 
2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료
Wooseung Kim
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래
Wooseung Kim
 

Viewers also liked (8)

2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagram
 
줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승
 
2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of Everything
 
Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to Cloudera's Flume

Flume basic
Flume basicFlume basic
Flume basic
Uday Vakalapudi
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
Swapnil Dubey
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 

Similar to Cloudera's Flume (20)

Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Flume basic
Flume basicFlume basic
Flume basic
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Flume
FlumeFlume
Flume
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Flume DS -JSP.pptx
Flume DS -JSP.pptxFlume DS -JSP.pptx
Flume DS -JSP.pptx
 
AdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic UltraESB
AdroitLogic UltraESB
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Module notes artificial intelligence and
Module notes artificial intelligence andModule notes artificial intelligence and
Module notes artificial intelligence and
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Ch 13: Attacking Other Users: Other Techniques (Part 1)
Ch 13: Attacking Other Users:  Other Techniques (Part 1)Ch 13: Attacking Other Users:  Other Techniques (Part 1)
Ch 13: Attacking Other Users: Other Techniques (Part 1)
 
Ch 10: Attacking Back-End Components
Ch 10: Attacking Back-End ComponentsCh 10: Attacking Back-End Components
Ch 10: Attacking Back-End Components
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Cloudera's Flume

  • 1. 1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Real Time Data Ingest into Hadoop using Flume Hari Shreedharan Committer and PMC Member, Apache Flume Software Engineer, Cloudera
  • 2. What is Flume • Collection, Aggregation of streaming Event Data • Typically used for log data • Significant advantages over ad-hoc solutions • Reliable, Scalable, Manageable, Customizable and High Performance • Declarative, Dynamic Configuration • Contextual Routing • Feature rich • Fully extensible 2
  • 3. Core Concepts: Event An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers. • Payload is opaque to Flume • Headers are specified as an unordered collection of string key- value pairs, with keys being unique across the collection • Headers can be used for contextual routing 3
  • 4. Core Concepts: Client An entity that generates events and sends them to one or more Agents. • Example • Flume log4j Appender • Custom Client using Client SDK (org.apache.flume.api) • Embedded Agent – An agent embedded within your application • Decouples Flume from the system where event data is consumed from • Not needed in all cases 4
  • 5. Core Concepts: Agent A container for hosting Sources, Channels, Sinks and other components that enable the transportation of events from one place to another. • Fundamental part of a Flume flow • Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components 5
  • 6. Typical Aggregation Flow 6 [Client]+  Agent [ Agent]*  Destination
  • 7. Core Concepts: Source An active component that receives events from a specialized location or mechanism and places it on one or Channels. • Different Source types: • Specialized sources for integrating with well-known systems. Example: Syslog, Netcat • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro • Require at least one channel to function 7
  • 8. Sources • Different Source types: • Specialized sources for integrating with well-known systems. Example: Spooling Files, Syslog, Netcat, JMS • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro, Thrift • Require at least one channel to function 8
  • 9. Core Concepts: Channel A passive component that buffers the incoming events until they are drained by Sinks. • Different Channels offer different levels of persistence: • Memory Channel: volatile • Data lost if JVM or machine restarts • File Channel: backed by WAL implementation • Data not lost unless the disk dies. • Eventually, when the agent comes back data can be accessed. • Channels are fully transactional • Provide weak ordering guarantees • Can work with any number of Sources and Sinks. 9
  • 10. Core Concepts: Sink An active component that removes events from a Channel and transmits them to their next hop destination. • Different types of Sinks: • Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Morphline-Solr, Elastic Search • Sinks support serialization to user’s preferred formats. • HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS. • IPC sink for Agent-to-Agent communication: Avro, Thrift • Require exactly one channel to function 10
  • 11. Flow Reliability Reliability based on: • Transactional Exchange between Agents • Persistence Characteristics of Channels in the Flow Also Available: • Built-in Load balancing Support • Built-in Failover Support 11
  • 12. Flow Reliability Normal Flow Communication Failure between Agents Communication Restored, Flow back to Normal 12
  • 13. Flow Handling Channels decouple impedance of upstream and downstream • Upstream burstiness is damped by channels • Downstream failures are transparently absorbed by channels  Sizing of channel capacity is key in realizing these benefits 13
  • 14. Configuration • Java Properties File Format # Comment line key1 = value key2 = multi-line value • Hierarchical, Name Based Configuration agent1.channels.myChannel.type = FILE agent1.channels.myChannel.capacity = 1000 • Uses soft references for establishing associations agent1.sources.mySource.type = HTTP agent1.sources.mySource.channels = myChannel 14
  • 15. Configuration Typical Deployment • All agents in a specific tier could be given the same name • One configuration file with entries for three agents can be used throughout 15
  • 16. Contextual Routing Achieved using Interceptors and Channel Selectors 16
  • 17. Interceptors Interceptor An Interceptor is a component applied to a source in pre-specified order to enable decorating and filtering of events where necessary. • Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc. • Custom interceptors can introspect event payload to create specific headers where necessary 17
  • 18. Contextual Routing Channel Selector A Channel Selector allows a Source to select one or more Channels from all the Channels that the Source is configured with based on preset criteria. • Built-in Channel Selectors: • Replicating: for duplicating the events • Multiplexing: for routing based on headers 18
  • 19. Contextual Routing • Terminal Sinks can directly use Headers to make destination selections • HDFS Sink can use headers values to create dynamic path for files that the event will be added to. • Some headers such as timestamps can be used in a more sophisticated manner • Custom Channel Selector can be used for doing specialized routing where necessary 19
  • 20. Load Balancing and Failover Sink Processor A Sink Processor is responsible for invoking one sink from an assigned group of sinks. • Built-in Sink Processors: • Load Balancing Sink Processor – using RANDOM, ROUND_ROBIN or Custom selection algorithm • Failover Sink Processor • Default Sink Processor 20
  • 21. Sink Processor • Invoked by Sink Runner • Acts as a proxy for a Sink 21
  • 22. Sink Processor Configuration • A Sink can exist in at most one group • A Sink that is not in any group is handled via Default Sink Processor • Caution: Removing a Sink Group does not make the sinks inactive! 22
  • 23. Client API • Simple API that can be used to send data to Flume agents. • Simplest form – send a batch of events to one agent. • Can be used to send data to multiple agents in a round-robin, random or failover fashion (send data to one till it fails). • Java only. • flume.thrift can be used to generate code for other languages. • Use with Thrift source. 23
  • 24. Embedded Agent • Client API throws exceptions if data could not be sent. • Applications may not be able to tolerate this. • Embedded Agent – A (limited) Flume agent within your application • Has a channel – so buffers data in-memory or on-disk till the data is sent or the channel is full. • Throws exceptions only if the channel is full (or error writing to channel). • Cushion for application if something causes data to be stuck within the application • Supports sending data to other Flume agents only, no HDFS, HBase etc. 24
  • 25. Summary • Clients send Events to Agents • Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink Processors, and Sinks. • Sources and Sinks are active components, where as Channels are passive • Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector • Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination • Channel operations are transactional to guarantee one-hop delivery semantics • Channel persistence allows for ensuring end-to-end reliability 25