Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi and Stream
Processing
Dhruv Kumar
Sr. Solutions Architect

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Simplistic View of Enterprise Data Flow
Store Data
Process and
Analyze Data
Acquire Data
Dataflow

Realistic View of Enterprise Data Flow
?
?
?
?
?
?
?

Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
Producer
C1
Consumer

Apache NiFi: The three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and
data plane

Visual Command & Control
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections

Apache NiFi – Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering

Matured at NSA 2006-2014
Brief history of the Apache NiFi Community
Code developed
at NSA
2006
Today
Achieved TLP
status in just
7 months
July 2015
Dev mailing list
Users mailing list*
182 subscribers producing ~100 emails/week
165 subscribers producing ~40 emails/week
55
125
1170
Code contributors
Pull requests via Github
JIRAs Filed.
Code available
open source
ASL v2
December 2014
*Only 5 months old
In 11 months…
6Targeting a 6-8
week release cycle
Releases 153 new in last two months
With more in pipeline
Committers 13 PMC Members Affiliations
Hortonworks, Twitter, Cloudera, US
Government, Defense Contractors, etc.

Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing,
transformation, or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing
various processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and
manages the threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send
data via ports. A process group allows creation of entirely new
component simply by composition of its components.

OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Architecture
OS/Host
JVM
NiFi Cluster Manger – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes

Apache NiFi’s uses are many…
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Joins / Complex Rolling Window Operations

HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges
Aggregate all IoAT data from sensors, geo-location devices, machines, logs,
files, and feeds via a highly secure lightweight agent
Collect: Bring Together• Logs
• Files
• Feeds
• Sensors
Mediate point-to-point and bi-directional data flows, delivering data
reliably to real-time applications and storage platforms such as HDP
Conduct: Mediate the Data Flow• Deliver
• Secure
• Govern
• Audit
Parse, filter, join, transform, fork, and clone data in motion to
empower analytics and perishable insights
Curate: Gain Insights• Parse
• Filter
• Transform
• Fork
• Clone

HDP + HDF Create Modern Data Apps
DATA AT
REST
HDF DATA
IN MOTION
ACTIONABLE
INTELLIGENCE
MODERN DATA APPS
Real-Time Cyber Security
protects systems with superior threat detection
Smart Manufacturing
dramatically improves yields by managing more
variables in greater detail
Connected, Autonomous Cars
drive themselves and improve road safety
Future Farming
optimizing soil, seeds and equipment to measured
conditions on each square foot
Automatic Recommendation Engines
match products to preferences in milliseconds

Streaming Architectures

Drive Data to Core for Analysis
NiFi
Stream
Processing
MiNiFi
MiNiFi
• Drive data from sources to central data center for analysis
• Tiered collection approach at various locations, think regional data centers
Edge
Edge
Core
Batch
Analytics

Dynamically Adjusting Data Flows
• Push contents back to core NiFi
• Push results back to edge locations/devices to change behavior
NiFi
MiNiFi
MiNiFi
Edge
Edge
Core
Batch
Analytics
Stream
Processing

Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Storm
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
Hortonworks DataFlow Reference Architecture
DB
Data WH
 Tiered processing framework
 Bi-directional communication
 Data prioritization
 Interactive command & control in the center, design & deploy on the edge

Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
 Campaign management: coupons/promotions/etc.
 Location based services
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH

Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
 Transaction processing
 Fraud detection
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH

Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
 Complex processing and cloud computing
 Historical data analytics based on nightly updates
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH

Apache NiFi vs Kafka
NiFi
Good for data traceability
and flow management
• Interactive command and control – real time
operational visibility
• Data provenance – real time visual chain of
custody
• Low scripting maintenance
⚠ Requires adding/removing processors
according to consumer-side updates
Kafka
Good for large number of consumers
and dynamic consumer-side updates
• Low latency
• Great data durability
• Support large number of
producers/consumers
⚠ Not optimized to manage dataflows
(prioritization, enrichment, protocols, formats,
event level authorizations, objects with various
sizes, etc.)

Apache NiFi vs Storm
NiFi
Good for data traceability, flow
management, and enrichment
• Data provenance – real time visual chain
of custody
• Security – end-to-end secure routing with
event level authorization
• Simple event processing
⚠ Scaling model allowing for processor level
workload to be only evenly distributed
across worker nodes
Storm
Good for streaming analytics
• Complex event processing
• Flexible scaling model, allowing to specify
workload distribution on-demand at bolt level
⚠ Not designed to manage data flows

In a nutshell…
NiFi
Hadoop
HDFS
HBase Hive SOLR
YARN
Storm
Service
Management /
Workflow
SIEM
Spark
Raw Network Stream
Network Metadata Stream
Data Stores
Syslog
Raw Application Logs
Other Streaming Telemetry

Key Tenants of Lambda Architecture
 Batch Layer
 Manages master data
 Immutable, append-only set of raw data
 Cleanse, Normalize & Pre-Compute
Batch Views
 Advanced Statistical Calculations
 Speed layer
 Real Time Event Stream Processing
 Computes Real-Time Views
 Serving Layer
 Low-latency, ad-hoc query
 Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
Fundamental Principles of Streaming Architectures

Storm/Spark Streaming
Storm
Detailed Reference Architecture for IoT Applications
HDF
Flume
Sink to
HDFS
Transform
Interactive
UI Framework
Hive
Hive
HDFS
HDFS
SOURCE DATA
Server logs
Application Logs
Firewall Logs
CRM/ERP
Sensor
Kafka
Kafka
Stream to
HDF
Forward to
Storm
Real Time Storage
Spark-ML
Pig
Alerts
Bolt to
HDFS
Dashboard
Silk
JMS
Alerts
Hive Server
HiveServer
Reporting
BI Tools
High Speed
Ingest
Real-Time
Batch Interactive
Machine Learning
Models
Spark
Pig
AlertsSQOOP
Flume
Iterative ML
Hbase/Pheonix
HBaseEvent Enrichment
Spark-Thrift
Pig

Demo!

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks

Similar to Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks

Editor's Notes