SlideShare a Scribd company logo
1 of 37
1© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Flume + Spark Streaming =
Ingest + Processing
2© Cloudera, Inc. All rights reserved.
What is Flume
• Collection, Aggregation of streaming Event Data
• Typically used for log data
• Significant advantages over ad-hoc solutions
• Reliable, Scalable, Manageable, Customizable and High Performance
• Declarative, Dynamic Configuration
• Contextual Routing
• Feature rich
• Fully extensible
3© Cloudera, Inc. All rights reserved.
Core Concepts: Event
An Event is the fundamental unit of data transported by Flume from its
point of origination to its final destination. Event is a byte array payload
accompanied by optional headers.
• Payload is opaque to Flume
• Headers are specified as an unordered collection of string key-value pairs, with
keys being unique across the collection
• Headers can be used for contextual routing
4© Cloudera, Inc. All rights reserved.
Core Concepts: Client
An entity that generates events and sends them to one or more Agents.
• Example
• Flume log4j Appender
• Custom Client using Client SDK (org.apache.flume.api)
• Embedded Agent – An agent embedded within your application
• Decouples Flume from the system where event data is consumed from
• Not needed in all cases
5© Cloudera, Inc. All rights reserved.
Core Concepts: Agent
A container for hosting Sources, Channels, Sinks and other components
that enable the transportation of events from one place to another.
• Fundamental part of a Flume flow
• Provides Configuration, Life-Cycle Management, and Monitoring Support
for hosted components
6© Cloudera, Inc. All rights reserved.
Inside a Flume agent
7© Cloudera, Inc. All rights reserved.
Typical Aggregation Flow
[Client]+  Agent [ Agent]*  Destination
8© Cloudera, Inc. All rights reserved.
Core Concepts: Source
An active component that receives events from a specialized location or
mechanism and places it on one or Channels.
• Different Source types:
• Specialized sources for integrating with well-known systems. Example:
Syslog, Netcat
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro
• Require at least one channel to function
9© Cloudera, Inc. All rights reserved.
Sources
• Different Source types:
• Specialized sources for integrating with well-known systems. Example:
Spooling Files, Syslog, Netcat, JMS
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro, Thrift
• Require at least one channel to function
10© Cloudera, Inc. All rights reserved.
Core Concepts: Channel
A passive component that buffers the incoming events until they are drained by
Sinks.
• Different Channels offer different levels of persistence:
• Memory Channel: volatile
• Data lost if JVM or machine restarts
• File Channel: backed by WAL implementation
• Data not lost unless the disk dies.
• Eventually, when the agent comes back data can be accessed.
• Channels are fully transactional
• Provide weak ordering guarantees
• Can work with any number of Sources and Sinks.
11© Cloudera, Inc. All rights reserved.
Core Concepts: Sink
An active component that removes events from a Channel and transmits
them to their next hop destination.
• Different types of Sinks:
• Terminal sinks that deposit events to their final destination. For example:
HDFS, HBase, Morphline-Solr, Elastic Search
• Sinks support serialization to user’s preferred formats.
• HDFS sink supports time-based and arbitrary bucketing of data while writing
to HDFS.
• IPC sink for Agent-to-Agent communication: Avro, Thrift
• Require exactly one channel to function
12© Cloudera, Inc. All rights reserved.
Flow Reliability
Normal Flow
Communication Failure between Agents
Communication Restored, Flow back to Normal
13© Cloudera, Inc. All rights reserved.
Flow Handling
Channels decouple impedance of upstream and downstream
• Upstream burstiness is damped by channels
• Downstream failures are transparently absorbed by channels
 Sizing of channel capacity is key in realizing these benefits
14© Cloudera, Inc. All rights reserved.
Interceptors
Interceptor
An Interceptor is a component applied to a source in pre-specified order
to enable decorating and filtering of events where necessary.
• Built-in Interceptors allow adding headers such as timestamps, hostname,
static markers etc.
• Custom interceptors can introspect event payload to create specific headers
where necessary
15© Cloudera, Inc. All rights reserved.
Contextual Routing
Channel Selector
A Channel Selector allows a Source to select one or more Channels from
all the Channels that the Source is configured with based on preset
criteria.
• Built-in Channel Selectors:
• Replicating: for duplicating the events
• Multiplexing: for routing based on headers
16© Cloudera, Inc. All rights reserved.
Contextual Routing
• Terminal Sinks can directly use Headers to make destination selections
• HDFS Sink can use headers values to create dynamic path for files that the
event will be added to.
• Some headers such as timestamps can be used in a more sophisticated
manner
• Custom Channel Selector can be used for doing specialized routing where
necessary
17© Cloudera, Inc. All rights reserved.
Client API
• Simple API that can be used to send data to Flume agents.
• Simplest form – send a batch of events to one agent.
• Can be used to send data to multiple agents in a round-robin, random or
failover fashion (send data to one till it fails).
• Java only.
• flume.thrift can be used to generate code for other languages.
• Use with Thrift source.
18© Cloudera, Inc. All rights reserved.
Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social
• Connected devices: 9B in 2012 to 50B by 2020
• Over 1 trillion sensors by 2020
• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?
• Value can quickly degrade → capture value immediately
• From reactive analysis to direct operational impact
• Unlocks new competitive advantages
• Requires a completely new approach...
19© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
•Easy to Develop
•Rich APIs in Java, Scala,
Python
•Interactive shell
•Fast to Run
•General execution graphs
•In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory
20© Cloudera, Inc. All rights reserved.
Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
21© Cloudera, Inc. All rights reserved.
RDDs
RDD = Resilient Distributed Datasets
• Immutable representation of data
• Operations on one RDD creates a new one
• Memory caching layer that stores data in a distributed, fault-tolerant cache
• Created by parallel transformations on data in stable storage
• Lazy materialization
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage
22© Cloudera, Inc. All rights reserved.
Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput
23© Cloudera, Inc. All rights reserved.
Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS
HBase
Data
Sources
24© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming
25© Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture
26© Cloudera, Inc. All rights reserved.
Use DStreams for Windowing Functions
27© Cloudera, Inc. All rights reserved.
Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.
28© Cloudera, Inc. All rights reserved.
Connectors
• Flume
• Kafka
• Amazon Kinesis
• MQTT
• Twitter (?!!)
29© Cloudera, Inc. All rights reserved.
Flume Polling Input DStream
• Reliable Receiver to pull data from Flume agents.
• Configure Flume agents to run the SparkSink
(org.apache.spark.streaming.flume.sink.SparkSink)
• Use FlumeUtils.createPollingStream to create a
FlumePollingInputDStream configured to connect to the
Flume agent(s)
• Enable WAL in Spark configuration to ensure no data loss
• Depends on Flume’s transactions for reliability
30© Cloudera, Inc. All rights reserved.
Flume Connector
• Flume Sink Flume receiver on Spark use Avro and uses the standard
event format Flume uses.
• No additional costs for using Spark Streaming vs just sending data from
one Flume agent to another!
• With WAL enabled, no data loss!
31© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
Executor
Receiver
Block
Manager
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
32© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to
the receiver
Executor
Receiver
Block
Manager
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
Event Stream
33© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to
the receiver
• The receiver stores all events in a single
store call
Executor
Receiver
Block
Manager
Block
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
Event Stream
34© Cloudera, Inc. All rights reserved.
Flume Transactions
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to
the receiver
• The receiver stores all events in a single
store call
• Once the store call returns, the receiver
sends an ACK to the sink
Executor
Receiver
Block
Manager
Block
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
ACK
Event Stream
35© Cloudera, Inc. All rights reserved.
Flume Transactions
Executor
Receiver
Block
Manager
Block
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
ACK
Event Stream
• Sink starts a transaction with the channel
• Sink then pulls events out and sends it to the
receiver
• The receiver stores all events in a single store call
• Once the store call returns, the receiver sends an
ACK to the sink
• The sink then commits the transaction
• Timeouts/NACKs trigger a rollback and data is re-
sent.
CommitTxn
36© Cloudera, Inc. All rights reserved.
Summary
• Clients send Events to Agents
• Agents hosts number Flume components – Source, Interceptors, Channel
Selectors, Channels, Sink Processors, and Sinks.
• Sources and Sinks are active components, where as Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not filtered,
puts them on channel(s) selected by the configured Channel Selector
• Sink Processor identifies a sink to invoke, that can take Events from a Channel
and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery semantics
• Channel persistence allows for ensuring end-to-end reliability
36
37© Cloudera, Inc. All rights reserved.
Thank you
hshreedharan@cloudera.com
@harisr1234

More Related Content

What's hot

Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 

What's hot (20)

Apache REEF - stdlib for big data
Apache REEF - stdlib for big dataApache REEF - stdlib for big data
Apache REEF - stdlib for big data
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Spark streaming and Kafka
Spark streaming and KafkaSpark streaming and Kafka
Spark streaming and Kafka
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 

Viewers also liked

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
Semicolon 2013 presentation
Semicolon 2013 presentationSemicolon 2013 presentation
Semicolon 2013 presentation
Pandurang Kamat
 
Flume with Twitter Integration
Flume with Twitter IntegrationFlume with Twitter Integration
Flume with Twitter Integration
RockyCIce
 

Viewers also liked (20)

Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Semicolon 2013 presentation
Semicolon 2013 presentationSemicolon 2013 presentation
Semicolon 2013 presentation
 
Flume with Twitter Integration
Flume with Twitter IntegrationFlume with Twitter Integration
Flume with Twitter Integration
 
Big data prototyping in AWS cloud
Big data prototyping in AWS cloudBig data prototyping in AWS cloud
Big data prototyping in AWS cloud
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Avvo fkafka
Avvo fkafkaAvvo fkafka
Avvo fkafka
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 
Marvin, Data Science & Spark – haben wir ohne Mathematik und Technik noch ein...
Marvin, Data Science & Spark – haben wir ohne Mathematik und Technik noch ein...Marvin, Data Science & Spark – haben wir ohne Mathematik und Technik noch ein...
Marvin, Data Science & Spark – haben wir ohne Mathematik und Technik noch ein...
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 

Similar to Spark+flume seattle

Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 

Similar to Spark+flume seattle (20)

End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
IoT Austin CUG talk
IoT Austin CUG talkIoT Austin CUG talk
IoT Austin CUG talk
 
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
 
Flying to clouds - can it be easy? Cloud Native Applications
Flying to clouds - can it be easy? Cloud Native ApplicationsFlying to clouds - can it be easy? Cloud Native Applications
Flying to clouds - can it be easy? Cloud Native Applications
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Spark+flume seattle

  • 1. 1© Cloudera, Inc. All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Flume + Spark Streaming = Ingest + Processing
  • 2. 2© Cloudera, Inc. All rights reserved. What is Flume • Collection, Aggregation of streaming Event Data • Typically used for log data • Significant advantages over ad-hoc solutions • Reliable, Scalable, Manageable, Customizable and High Performance • Declarative, Dynamic Configuration • Contextual Routing • Feature rich • Fully extensible
  • 3. 3© Cloudera, Inc. All rights reserved. Core Concepts: Event An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers. • Payload is opaque to Flume • Headers are specified as an unordered collection of string key-value pairs, with keys being unique across the collection • Headers can be used for contextual routing
  • 4. 4© Cloudera, Inc. All rights reserved. Core Concepts: Client An entity that generates events and sends them to one or more Agents. • Example • Flume log4j Appender • Custom Client using Client SDK (org.apache.flume.api) • Embedded Agent – An agent embedded within your application • Decouples Flume from the system where event data is consumed from • Not needed in all cases
  • 5. 5© Cloudera, Inc. All rights reserved. Core Concepts: Agent A container for hosting Sources, Channels, Sinks and other components that enable the transportation of events from one place to another. • Fundamental part of a Flume flow • Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components
  • 6. 6© Cloudera, Inc. All rights reserved. Inside a Flume agent
  • 7. 7© Cloudera, Inc. All rights reserved. Typical Aggregation Flow [Client]+  Agent [ Agent]*  Destination
  • 8. 8© Cloudera, Inc. All rights reserved. Core Concepts: Source An active component that receives events from a specialized location or mechanism and places it on one or Channels. • Different Source types: • Specialized sources for integrating with well-known systems. Example: Syslog, Netcat • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro • Require at least one channel to function
  • 9. 9© Cloudera, Inc. All rights reserved. Sources • Different Source types: • Specialized sources for integrating with well-known systems. Example: Spooling Files, Syslog, Netcat, JMS • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro, Thrift • Require at least one channel to function
  • 10. 10© Cloudera, Inc. All rights reserved. Core Concepts: Channel A passive component that buffers the incoming events until they are drained by Sinks. • Different Channels offer different levels of persistence: • Memory Channel: volatile • Data lost if JVM or machine restarts • File Channel: backed by WAL implementation • Data not lost unless the disk dies. • Eventually, when the agent comes back data can be accessed. • Channels are fully transactional • Provide weak ordering guarantees • Can work with any number of Sources and Sinks.
  • 11. 11© Cloudera, Inc. All rights reserved. Core Concepts: Sink An active component that removes events from a Channel and transmits them to their next hop destination. • Different types of Sinks: • Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Morphline-Solr, Elastic Search • Sinks support serialization to user’s preferred formats. • HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS. • IPC sink for Agent-to-Agent communication: Avro, Thrift • Require exactly one channel to function
  • 12. 12© Cloudera, Inc. All rights reserved. Flow Reliability Normal Flow Communication Failure between Agents Communication Restored, Flow back to Normal
  • 13. 13© Cloudera, Inc. All rights reserved. Flow Handling Channels decouple impedance of upstream and downstream • Upstream burstiness is damped by channels • Downstream failures are transparently absorbed by channels  Sizing of channel capacity is key in realizing these benefits
  • 14. 14© Cloudera, Inc. All rights reserved. Interceptors Interceptor An Interceptor is a component applied to a source in pre-specified order to enable decorating and filtering of events where necessary. • Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc. • Custom interceptors can introspect event payload to create specific headers where necessary
  • 15. 15© Cloudera, Inc. All rights reserved. Contextual Routing Channel Selector A Channel Selector allows a Source to select one or more Channels from all the Channels that the Source is configured with based on preset criteria. • Built-in Channel Selectors: • Replicating: for duplicating the events • Multiplexing: for routing based on headers
  • 16. 16© Cloudera, Inc. All rights reserved. Contextual Routing • Terminal Sinks can directly use Headers to make destination selections • HDFS Sink can use headers values to create dynamic path for files that the event will be added to. • Some headers such as timestamps can be used in a more sophisticated manner • Custom Channel Selector can be used for doing specialized routing where necessary
  • 17. 17© Cloudera, Inc. All rights reserved. Client API • Simple API that can be used to send data to Flume agents. • Simplest form – send a batch of events to one agent. • Can be used to send data to multiple agents in a round-robin, random or failover fashion (send data to one till it fails). • Java only. • flume.thrift can be used to generate code for other languages. • Use with Thrift source.
  • 18. 18© Cloudera, Inc. All rights reserved. Motivation for Real-Time Stream Processing Data is being created at unprecedented rates • Exponential data growth from mobile, web, social • Connected devices: 9B in 2012 to 50B by 2020 • Over 1 trillion sensors by 2020 • Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? • Value can quickly degrade → capture value immediately • From reactive analysis to direct operational impact • Unlocks new competitive advantages • Requires a completely new approach...
  • 19. 19© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data •Easy to Develop •Rich APIs in Java, Scala, Python •Interactive shell •Fast to Run •General execution graphs •In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 20. 20© Cloudera, Inc. All rights reserved. Spark Architecture Driver Worker Worker Worker Data RAM Data RAM Data RAM
  • 21. 21© Cloudera, Inc. All rights reserved. RDDs RDD = Resilient Distributed Datasets • Immutable representation of data • Operations on one RDD creates a new one • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage • Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage
  • 22. 22© Cloudera, Inc. All rights reserved. Spark Streaming Extension of Apache Spark’s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput
  • 23. 23© Cloudera, Inc. All rights reserved. Canonical Stream Processing Architecture Kafka Data Ingest App 1 App 2 . . . Kafka Flume HDFS HBase Data Sources
  • 24. 24© Cloudera, Inc. All rights reserved. Spark Streaming • Incoming data represented as Discretized Streams (DStreams) • Stream is broken down into micro-batches • Each micro-batch is an RDD – can share code between batch and streaming
  • 25. 25© Cloudera, Inc. All rights reserved. val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2tweets DStream hashTags DStream Stream composed of small (1-10s) batch computations “Micro-batch” Architecture
  • 26. 26© Cloudera, Inc. All rights reserved. Use DStreams for Windowing Functions
  • 27. 27© Cloudera, Inc. All rights reserved. Spark Streaming Use-Cases • Real-time dashboards • Show approximate results in real-time • Reconcile periodically with source-of-truth using Spark • Joins of multiple streams • Time-based or count-based “windows” • Combine multiple sources of input to produce composite data • Re-use RDDs created by Streaming in other Spark jobs.
  • 28. 28© Cloudera, Inc. All rights reserved. Connectors • Flume • Kafka • Amazon Kinesis • MQTT • Twitter (?!!)
  • 29. 29© Cloudera, Inc. All rights reserved. Flume Polling Input DStream • Reliable Receiver to pull data from Flume agents. • Configure Flume agents to run the SparkSink (org.apache.spark.streaming.flume.sink.SparkSink) • Use FlumeUtils.createPollingStream to create a FlumePollingInputDStream configured to connect to the Flume agent(s) • Enable WAL in Spark configuration to ensure no data loss • Depends on Flume’s transactions for reliability
  • 30. 30© Cloudera, Inc. All rights reserved. Flume Connector • Flume Sink Flume receiver on Spark use Avro and uses the standard event format Flume uses. • No additional costs for using Spark Streaming vs just sending data from one Flume agent to another! • With WAL enabled, no data loss!
  • 31. 31© Cloudera, Inc. All rights reserved. Flume Transactions • Sink starts a transaction with the channel Executor Receiver Block Manager Flume Agent Spark Sink Flume Channel StartTxn
  • 32. 32© Cloudera, Inc. All rights reserved. Flume Transactions • Sink starts a transaction with the channel • Sink then pulls events out and sends it to the receiver Executor Receiver Block Manager Flume Agent Spark Sink Flume Channel StartTxn Event Stream
  • 33. 33© Cloudera, Inc. All rights reserved. Flume Transactions • Sink starts a transaction with the channel • Sink then pulls events out and sends it to the receiver • The receiver stores all events in a single store call Executor Receiver Block Manager Block Flume Agent Spark Sink Flume Channel StartTxn Event Stream
  • 34. 34© Cloudera, Inc. All rights reserved. Flume Transactions • Sink starts a transaction with the channel • Sink then pulls events out and sends it to the receiver • The receiver stores all events in a single store call • Once the store call returns, the receiver sends an ACK to the sink Executor Receiver Block Manager Block Flume Agent Spark Sink Flume Channel StartTxn ACK Event Stream
  • 35. 35© Cloudera, Inc. All rights reserved. Flume Transactions Executor Receiver Block Manager Block Flume Agent Spark Sink Flume Channel StartTxn ACK Event Stream • Sink starts a transaction with the channel • Sink then pulls events out and sends it to the receiver • The receiver stores all events in a single store call • Once the store call returns, the receiver sends an ACK to the sink • The sink then commits the transaction • Timeouts/NACKs trigger a rollback and data is re- sent. CommitTxn
  • 36. 36© Cloudera, Inc. All rights reserved. Summary • Clients send Events to Agents • Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink Processors, and Sinks. • Sources and Sinks are active components, where as Channels are passive • Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector • Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination • Channel operations are transactional to guarantee one-hop delivery semantics • Channel persistence allows for ensuring end-to-end reliability 36
  • 37. 37© Cloudera, Inc. All rights reserved. Thank you hshreedharan@cloudera.com @harisr1234

Editor's Notes

  1. The Narrative: Vast quantities of streaming data are being generated, and more will be generated thanks to phenomenon SINGULAR/PLURAL such as the internet of things. The motivation for Real-Time Stream processing is to turn all this data into valuable insights and actions, as soon as the data is generated. Instant processing of the data also opens the door to new use cases that were not possible before. NOTE: Feel free to remove the cheesy image of “The Flash”, if it feels unprofessional or overly cheesy
  2. Purpose of this Slide: Make sure to associate Spark Streaming with Apache Spark, so folks know it is a part of THE Apache Spark that everyone is talking about. List some of the key properties that make Spark Streaming a good platform for stream processing. Touch upon the key attributes that make it good for stream processing. Note: If required, we can mention low latency as well.