Contenu connexe Similaire à Discover HDP 2.2: Apache Falcon for Hadoop Data Governance (20) Discover HDP 2.2: Apache Falcon for Hadoop Data Governance1. Discover HDP 2.2:
Apache Falcon for Hadoop Data Governance
Page 1 © Hortonworks Inc. 2014
Hortonworks. We do Hadoop.
2. Speakers
Page 2 © Hortonworks Inc. 2014
Justin Sears
Hortonworks Product Marketing Manager
Andrew Ahn
Hortonworks Director of Product Management for Data
Governance in Hortonworks Data Platform
Venkatesh Seetharam
Foundational Hadoop Architect, Committer and PMC
Member for Apache Falcon
3. Agenda
• Introduction to Apache Falcon
• New Innovation in Apache Falcon 0.6.0
§ HDFS Mirroring
§ Cloud Replication
• A Look Ahead
• Q & A
We’ll move quickly:
• Attendee phone lines are muted
• Text any questions to Andrew Ahn using Webex chat
• Questions answered at the end
• Unanswered questions and answers in upcoming blog post
Page 3 © Hortonworks Inc. 2014
4. Big Data, Hadoop & Data Center Re-platforming
Business Drivers
• From reactive analytics
to proactive interactions
• Insights that drive
competitive advantage
& optimal returns
Page 4 © Hortonworks Inc. 2014
$
Financial Drivers
• Cost of data systems, as
% of IT spend,
continues to grow
• Cost advantages of
commodity hardware
& open source software
Technical Drivers
• Data is growing
exponentially & existing
systems overwhelmed
• Predominantly driven by
NEW types of data that
can inform analytics
There is an inequitable balance between vendor and customer in the market
5. Clickstream
Capture and analyze
website visitors’ data
trails and optimize
your website
Page 5 © Hortonworks Inc. 2014
Sensors
Discover patterns in
data streaming
automatically from
remote sensors and
machines
Server Logs
Research logs to
diagnose process
failures and prevent
security breaches
Hadoop Value: New Types of Data
Sentiment
Understand how
your customers feel
about your brand
and products –
right now
Geographic
Analyze location-based
data to
manage operations
where they occur
Unstructured
Understand patterns
in files across millions
of web pages, emails,
and documents
6. A Shift from Reactive to Proactive Interactions
A shift in Advertising
From mass branding …to 1x1 Targeting
A shift in Financial Services
From Educated Investing …to Automated Algorithms
A shift in Healthcare
From mass treatment …to Designer Medicine
A shift in Retail
A shift in Telco
Page 6 © Hortonworks Inc. 2014
HDP and Hadoop allow
organizations to use
data to shift interactions
from…
Reactive
Post Transaction
Proactive
Pre Decision
…to Real-t From static branding ime Personalization
From break then fix …to repair before break
7. Enterprise Goals for the Modern Data Architecture
Batch Interactive Real-Time
Page 7 © Hortonworks Inc. 2014
• Consolidate siloed data sets structured
and unstructured
• Central data set on a single cluster
• Multiple workloads across batch
interactive and real time
• Central services for security, governance
and operation
• Preserve existing investment in current
tools and platforms
• Single view of the customer, product,
supply chain
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
°
° ° ° ° ° ° ° ° N
CRM
ERP
Other
1 ° ° °
° ° ° HDFS
(Hadoop Distributed File System)
SOURCES
EXISTING
Systems
Clickstream
Web
&Social
Geoloca9on
Sensor
&
Machine
Server
Logs
Unstructured
8. YARN Transformed Hadoop & Opened a New Era
Script
Pig
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
SQL
Hive
TezTez
Page 8 © Hortonworks Inc. 2014
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
Others
ISV
Engines
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
In-Memory
Spark
9. YARN Extends Hadoop to Other Data Center Leaders
Script
Pig
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
NoSQL
HBase
Accumulo
Sli der
1 ° ° ° ° ° ° °
Stream
Storm
Slider
HDFS
In-Memory
Spark
(Hadoop Distributed File System)
° ° ° ° ° ° ° °
Page 9 © Hortonworks Inc. 2014
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN: Data Operating System
(Cluster Resource Management)
° °
° °
Others
ISV
Engines
Search
Solr
° ° ° ° °
° ° ° ° °
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing “YARN Ready” solutions
10. Enterprise Hadoop: Central Set of Services
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
GOVERNANCE SECURITY OPERATIONS
Tez
TezTez
Page 10 © Hortonworks Inc. 2014
Slider
Slider
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be
an Enterprise Data Platform
with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into
Hadoop inherits these services
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
In-Memory
Spark
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
11. Hortonworks Development Investment for the Enterprise
Vertical Integration with YARN and HDFS
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
GOVERNANCE SECURITY OPERATIONS
Tez
TezTez
Slider
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Page 11 © Hortonworks Inc. 2014
Slider
° °
° °
° ° ° ° °
° ° ° ° °
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
In-Memory
Spark
Others
ISV
Engines
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
• Ensure engines can run reliably and respectfully in a YARN based cluster
• Implement features throughout the stack to accommodate
12. Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
GOVERNANCE SECURITY OPERATIONS
Tez
TezTez
Slider
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Page 12 © Hortonworks Inc. 2014
Slider
° °
° °
° ° ° ° °
° ° ° ° °
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
In-Memory
Spark
Others
ISV
Engines
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
• Ensure consistent enterprise services are applied across the entire Hadoop stack
• Integrate with and extend existing data center solutions for these key requirements
13. HDP Delivers Enterprise Hadoop
Hortonworks Data Platform 2.2
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS
Script
Pig
SQL
Hive
TezTez
Page 13 © Hortonworks Inc. 2014
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
Linux Windows Deployment Choice Cloud
YARN is the architectural
center of HDP
• Common data set across all
applications
• Batch, interactive & real-time
workloads
• Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
• Governance
• Security
• Operations
Enables broad
ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options
• Linux & Windows
• On premises & cloud
Others
ISV
Engines
On-Premises
14. HDP Delivers Enterprise Hadoop
Hortonworks Data Platform 2.2
Script
Pig
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS
SQL
Hive
TezTez
Page 14 © Hortonworks Inc. 2014
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
YARN is the architectural
center of HDP
• Common data set across all
applications
• Batch, interactive & real-time
workloads
• Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
• Governance
• Security
• Operations
Enables broad
ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options
• Linux & Windows
• On premises & cloud
Others
ISV
Engines
Linux Windows Deployment Choice On-Premises Cloud
GOVERNANCE
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
16. Falcon Overview
Centrally Manage Data Lifecycle
– Centralized definition & management of pipelines for data ingest, process &
export
Business Continuity & Disaster Recovery
– Out of the box policies for data replication & retention
– End to end monitoring of data pipelines
Address audit & compliance
requirements
– Visualize data pipeline lineage
– Track data pipeline audit logs
– Tag data with business metadata
Page 16 © Hortonworks Inc. 2014
The data traffic cop
17. Falcon Architecture
Page 17 © Hortonworks Inc. 2014
Centralized Falcon Orchestration Framework
Falcon
Server
Entity
Specs Scheduled Jobs Process
Status
Hadoop ecosystem tools
JMS
API
&
UI
AMBARI
HDFS / Hive
Oozie
MapRed / Pig / Hive / Sqoop /
Flume / DistCP
Data
stewards
+
Hadoop
admins
18. Data Pipeline: Definition
• XML based pipeline specification
– Modular - Clusters, feeds & processes defined separately and then linked together
– Easy to re-use across multiple pipelines
• Out of the box policies
– Predefined policies for replication, late data handling & eviction
– Easily customization of policies
• Extensible
– Plug in external solutions at any step of the pipeline
– Eg. Invoke third party data obfuscation components
Page 18 © Hortonworks Inc. 2014
19. Data Pipeline: Monitoring
Hadoop Cluster-1 Hadoop Cluster-2
Page 19 © Hortonworks Inc. 2014
DATA
raw clean prep raw clean prep
Primary site DR site
Centralized monitoring of data pipeline with
Falcon + Ambari
Pipeline run
alerts
Pipeline run
history
Pipeline
Scheduling
20. Data Pipeline: Tracing
Data pipeline
dependencies
Store feed feed
.
Customer
feed
Purchase
feed
Product
View dependencies
between clusters,
datasets and processes
Page 20 © Hortonworks Inc. 2014
Data pipeline
tagging
Sensitive Encrypted
Credit
feed
Add arbitrary tags to
feeds & processes
Data pipeline
audits
Know who modified a
dataset when and into
what
Coming Soon
Data pipeline
File-1
File-2
lineage
File-3
Analyze how a
dataset reached a
particular state
21. Replication with Falcon
Primary Hadoop Cluster
Staged Data Presented
Page 21 © Hortonworks Inc. 2014
Data
Cleansed
Data
Conformed
Data
Staged Data Presented
Data
Replication
Failover Hadoop Cluster
Replication
BI
/
Analy9cs
BusinessObjects BI
• Falcon manages workflow and replication
• Enables business continuity without requiring full data reprocessing
• Failover clusters can be smaller than primary clusters
22. Data Retention with Falcon
Staged Data Presented
Retention
Policy
Page 22 © Hortonworks Inc. 2014
Data
Cleansed
Data
Conformed
Data
Retain 5
Years
Retain Last
Copy Only
Retain 3
Years
Retain 3
Years
• Sophisticated retention policies expressed in one place
• Simplify data retention for audit, compliance, or for data re-processing
23. Late Data Handling with Falcon
Wait up to 4
hours for FTP
data to arrive
Page 23 © Hortonworks Inc. 2014
Staged Data Combined Data
Online
Transaction Data
(via Sqoop)
Web Log Data
(via FTP)
• Processing waits until all required input data is available
• Checks for late data arrivals, issues retrigger processing as necessary
• Eliminates writing complex data handling rules within applications
24. Falcon Investment Plans
Page 24 © Hortonworks Inc. 2014
DATES AND FEATURES SUBJECT TO CHANGE
November 2014 Future Release
• Authentication & Authorization
Integration
• Pipeline, (HDFS file & Hive) table
Lineage GA
• HDFS DR Replication with Recipes
• UI for Lineage management
• Replicate to Cloud - Azure & S3
Post-HDP 2.2 Tech Preview
• Hive/HCat metastore Replication
• Expanded UI Entity creation and
management.
• Hive/HCat metastore Replication GA
• Pipeline Run Notification via SNMP,
e-mail, etc.
• Hive ACID support
• HDFS Snapshot Integration
• File import SSH & SCP
• Visual Pipeline Designer
• Resource Metrics
• Automated migration of data through
HDFS storage tiers
25. New in Apache Falcon 0.6.0:
HDFS Mirroring
Page 25 © Hortonworks Inc. 2014
26. DR Mirroring of HDFS with Recipes
Properties
Properties
Page 26 © Hortonworks Inc. 2014
• Mirroring for Disaster
Recovery and Business
continuity use cases.
• Customizable for mulitple
targets and frequency of
synchronization
• Recipes: Template model
re-use of complex workflows
Recipe
Reduce
Cleanse
Replicate
Properties
Workflow
Template
Recipe
Reduce
Cleanse
Replicate
Workflow
Template
Recipe
Reduce
Cleanse
Replicate
Workflow
Template
27. New in Apache Falcon 0.6.0:
Cloud Replication
Page 27 © Hortonworks Inc. 2014
28. Replication to Cloud
Page 28 © Hortonworks Inc. 2014
• Seemlessly replicate to Cloud
targets
• Replicate from Cloud as a source.
• Support for Amazon S3 and
Microsoft Azure
Azure
Amazon S3
On Prem Cluster
36. Q & A
Page 36 © Hortonworks Inc. 2014
37. Thank you!
Learn more at:
hortonworks.com/hadoop/falcon/
Page 37 © Hortonworks Inc. 2014
Register for the remaining 5
Discover HDP 2.2 Webinars
Hortonworks.com/webinars