Data Distribution Patterns with Apache NiFi1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Distribution Patterns with Apache NiFi
March 2016
Bryan Bende – Member of Technical Staff
2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview
• Frequently Asked Question:
“How do I distribute data across my NiFi cluster?”
• Answer:
“It depends…”
• Typically based on whether the data is being pushed or pulled
3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pushing to Listeners
NCM
Node 1
ListenHTTP
Node 2
ListenHTTP
Load
Balancer
Data Producer
Data Producer
Data Producer
• Processors listening for incoming
data on each node in the cluster
• Load balancer sitting in front of the
cluster pointing at listeners
• Data producers push data to load
balancer which distributes data
across the cluster
• Same approach works for
ListenSyslog, ListenUDP, and
HandleHttpRequest
4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pulling on All Nodes
NCM
Node 1
GetKafka
Node 2
GetKafka
Kafka Topic
• Works when data source ensures each
request gets a unique piece of data
• Kafka sees each GetKafka processor as
the same client
• Each GetKafka processor gets different
data
5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pulling With List & Fetch Operations
NCM
Node 1 (Primary)
ListHDFS
HDFS
FetchHDFS
RPG
Input Port
Node 2
ListHDFS
FetchHDFS
RPG
Input Port
• Perform “list” operation on
primary node
• Send results to Remote
Process Group pointing back
to same cluster
• Redistributes results to all
nodes to perform “fetch” in
parallel
• Same approach for ListFile +
FetchFile and ListSFTP +
FetchSFTP
6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Site-To-Site Push
NCM
Node 1
Input Port
Node 2
Input Port
Standalone NiFi
RPG
• Site-To-Site makes a direct
connection between NiFi instances
• Either side can be a cluster or
standalone instance
• In a push scenario, the source
connects a Remote Process Group
to an Input Port on the destination
• If pushing to a cluster, Site-To-Site
takes care of load balancing across
the nodes in the cluster
7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Site-To-Site Pull
NCM
Node 1
RPG
Node 2
RPG
Standalone NiFi
Output Port
• In a pull scenario, the destination
connects a Remote Process
Group to an Output Port on the
source
• Each node will pull different data
from the source
• If the source was a cluster, each
node would pull from each node in
the source cluster
8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
Several mechanism for distributing data across a cluster…
• Push to Listeners with a load balancer in front
• Pull from a source that provides a queue
• Pull with a List + Fetch approach
• Push with Site-to-Site
• Pull with Site-to-Site
Contact Info
• bbende@hortonworks.com
• Twitter - @bbende