Contenu connexe Similaire à You Can't Search Without Data (20) You Can't Search Without Data11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi
• Created to address the challenges of global enterprise dataflow
• Key features:
– Visual Command and Control
– Data Lineage (Provenance)
– Data Prioritization
– Data Buffering/Back-Pressure
– Control Latency vs. Throughput
– Secure Control Plane / Data Plane
– Scale Out Clustering
– Extensibility
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Core Concepts
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing, transformation,
or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages the
threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send data via
ports. A process group allows creation of entirely new component simply by
composition of its components.
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture - Standalone
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
à FlowFile Repository
– Write Ahead Log
– State of every FlowFile
– Pointers to content repository
(pass-by-reference)
à Content Repository
– FlowFile content
– Copy-on-write
à Provenance Repository
– Write Ahead Log + Lucene Indexes
– Store & search lineage events
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Architecture - Cluster
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
ZooKeeper
à Same dataflow on each node,
data partitioned across cluster
à Access the UI from any node
à ZooKeeper for auto-election of
Cluster Coordinator & Primary
Node
à Cluster Coordinator receives
heartbeats from other nodes,
manages joining/ disconnecting
à Primary Node for scheduling
processors on a single node
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GetSolr
à Incrementally extract new documents
à Main query is *:*, Solr Query is
optional filter query
à Date Field used as filter query, from
last execution or initial value
à Sorted by date field and unique key
à Cursor mark used behind the scenes
à Specify return fields, or all if blank
à Output Solr XML, or Records
26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Problem – Conversion Between Data Formats
à Specialized processors to operate on different data types
à Sometimes missing conversions
à Sometimes missing a specific function for a data type
à Sometimes implemented with different libraries causing inconsistencies
27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Solution – Record Processing
à Introduce the concept of a ”record”
– Released in Apache NiFi 1.2.0 (May 2017), improvements in 1.3.0 and 1.4.0
à Centralize the logic for reading/writing records into controller services
– Readers/Writers for CSV, Json, Avro, etc.
à Provide standard processors that operate on records
– ConvertRecord, QueryRecord, PartitionRecord, UpdateRecord, etc.
à Provide integration with schema registries
– Local Schema Registry, Hortonworks Schema Registry, Confluent Schema Registry
à Can still handle arbitrary data, but process records when appropriate
28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Problem – Variable Handling
à Need to parametrize values in the flow per environment
– Connection strings, URLs, File System paths, etc.
à Can set variables in bootstrap.conf
– -Dmy.var=foo
à Can set a properties file in nifi.properties
– nifi.variable.registry.properties=production.properties
à Both require command line access
à Both require restart to pick up changes
29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Solution – First Class Variable Registry
à Variables associated with a process group, released in 1.4.0
à Right-click on canvas to view variables for current group
à Hierarchical order of precedence, resolve closest reference to
component
à Editing variables automatically restarts any components
referencing the variables
30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Problem – How do I deploy my flow?
à Most organizations want the classic development lifecycle (dev -> int -> prod)
à Can copy flow.xml.gz between environments
– Requires copying entire data flow
– Can’t tell what changed, hard to diff if you put in version control
– Requires all environments use the same encryption key for sensitive properties
à Can make templates for portions of the flow
– Script creation of template and deployment to next environment
– Requires stopping flow and removing components, then re-instantiating template
– No easy way to see changes, hard to rollback
31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Solution – NiFi Registry
à DISCLAIMER - UNDER DEVELOPMENT & NOT RELEASED YET!
à Complimentary application, sub-project of Apache NiFi
– https://github.com/apache/nifi-registry
– https://issues.apache.org/jira/projects/NIFIREG
à Central location for storage/management of shared resources across NiFi instances
à Initial capability to store and retrieve “versioned flows”
à A versioned flow is a snapshot of a process group at a given point in time
à Potentially store extensions, shared data sets, and more in the future
33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example Scenario
à User data
– https://randomuser.me
à Initially in CSV format
– name.title,name.first,name.last,email,registered
– mr,dennis,reyes,dennis.reyes@example.com,2012-04-10 01:54:19
– miss,carole,gomez,carole.gomez@example.com,2002-12-17 22:15:49
à Requirements
– Convert CSV to JSON
– Add a full_name field with first name + last name
– Add a gender field based on title (i.e. if title == mr then MALE)
– Ingest to different Solr collections depending on environment