Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks
This document discusses Apache NiFi and stream processing. It provides an overview of NiFi's key concepts of managing data flow, data provenance, and securing data. NiFi allows users to visually build data flows with drag and drop processors. It offers features such as guaranteed delivery, data buffering, prioritized queuing, and data provenance. NiFi is based on Flow-Based Programming and is used to reliably transfer data between systems, enrich and prepare data, and deliver data to analytic platforms.
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Similar to Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方HortonworksJapan
Similar to Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks (20)
WSO2's API Vision: Unifying Control, Empowering Developers
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks
23. In a nutshell…
NiFi
Hadoop
HDFS
HBase Hive SOLR
YARN
Storm
Service
Management /
Workflow
SIEM
Spark
Raw Network Stream
Network Metadata Stream
Data Stores
Syslog
Raw Application Logs
Other Streaming Telemetry
24. Key Tenants of Lambda Architecture
Batch Layer
Manages master data
Immutable, append-only set of raw data
Cleanse, Normalize & Pre-Compute
Batch Views
Advanced Statistical Calculations
Speed layer
Real Time Event Stream Processing
Computes Real-Time Views
Serving Layer
Low-latency, ad-hoc query
Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
Fundamental Principles of Streaming Architectures
Introduce Flow Based Programming fundamentals, why they matter, and how NiFi adopts them
Introduce the architecture of NiFi, describe major system components, and describe the single node and clustering models.
For each component describe its available (and potential)deployment models (relate it to Hadoop).
HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges
- HDF provides 3 key capabilities – the ability to collect data from different types of data sources via a highly secure lightweigt agent, the ability to mediate the data flow to/from the data source and the “collector”, and the ability to trace, parse, transform data in motion to enable analytics and derive insights within an operationally relevant time window.
Systems fail
Networks fail, disks fail, software crashes, people make mistakes.
Data access exceeds capacity to consume
Sometimes a given data source can outpace some part of the processing or delivery chain - it only takes one weak-link to have an issue.
Boundary conditions are mere suggestions
You will invariably get data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format.
What is noise one day becomes signal the next
Priorities of an organization change - rapidly. Enabling new flows and changing existing ones must be fast.
Systems evolve at different rates
The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together.
Compliance and security
Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted, accountable.
Continuous improvement occurs in production
It is often not possible to come even close to replicating production environments in the lab.
TALK TRACK
Here are just a few of the modern data apps that convert yesterday’s impossible challenges into today’s new products, cures, conveniences and life saving innovations.
These apps are either custom-built by our customers or they come of the shelf, created by Hortonworks or one of of our ecosystem partners to solve a particular problem.
Symantec and other cyber security leaders have built powerful apps to detect threats to digital information.
Leading pharma, automotive, consumer electronics and packaged goods companies are building their factories of the future that use actionable intelligence to improve manufacturing yields.
And age-old industries like automotive, agriculture and retail are taking connected data platforms on the road, through the field or to the cash register to do things that have never before been possible.
[NEXT SLIDE]
Tiered processing framework: often times not necessary to centralize every thing back to data center. Processing can happen in regional offices as well as on the edge devices, for efficiency (fraud detection logic defined in branch offices, etc.)
Bi-directional communication: real-time analytical results can be pushed back to the edge, adjust flow behavior accordingly. Example: prioritize data collection based on real-time bandwidth (calculated in DC with Flink jobs); fraud detection, send triggering events back to the edge to block transactions in real-time
Data prioritization: prioritize data flow, example: higher priority data can be sent back via LTE, lower priority data can wait until wifi becomes available.
Interactive vs design/deploy: in data center, complex flow, interactive command and control, allowing users to fix pipes without shutting down the water; design data flow with a visual interface in DC, and push to multiple MINIFI agents with one click (also providing a centralized place to version control flows on all the agents).
CapOne – Ingesting from everywhere
Email, Syslog, Applog, Netflow…
Moving to “Cloud Only model”….even looking to use “docker Containers” in Amazon…
Roll forward a few years, Hadoop today provides a complete platform to address the batch, serving and speed layers of the Lambda Architecture.
The team puts together a detailed architecture of the proposed solution using HDP and HDF. The architecture considers sources data from the numerous sources including Server Logs, Application Logs, XML and Senso data. This data is easily accepted into the flexible schema of HDP using HDF and Sqoop. The data is processed using Pig and analyzed using Spark. Then the data is made available in a real-time dashboard as well as to visualization and reporting tools.
[NEXT SLIDE]