SlideShare a Scribd company logo
1 of 50
Download to read offline
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Implementing a Data Lake with Enterprise
Grade Data Governance
We do Hadoop.
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Your speakers
Andrew Ahn
Governance Product Manager, Hortonworks
Oliver Claude
CMO at Waterline
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP: Data Governance
We Do Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Data Governance Goals
GOAL: Provide a common approach to
data governance across all systems
and data within the organization
•  Transparent
Governance standards & protocols must be
clearly defined and available to all
•  Reproducible
Recreate the relevant data landscape at a
point in time
•  Auditable
All relevant events and assets but be
traceable with appropriate historical lineage
•  Consistent
Compliance practices must be consistent
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Governance
Framework
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Challenges WITHIN Hadoop
•  No comprehensive governance within
the Hadoop stack
•  Mostly disjoint as each project defines its own
future and there is no common framework
•  Disparate tools, such as HCatalog, Ranger and
Falcon provide pieces of the overall solution
•  No integration with external governance
frameworks
•  Difficult to get right because each project
is autonomous and you need insight into
traditional IT
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Initiative for Hadoop
ETL/DQ
BPM
Business
Analytics
Visualization
& Dashboards
ERP
CRM
SCM
MDM
ARCHIVE
Data Governance Initiative
Common
Governance
Framework
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
°
°
ApachePig
ApacheHive
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
TWO Requirements
1.  Hadoop must snap in to
the existing frameworks
and be a good citizen
2.  Hadoop must also provide
governance within its own
stack of technologies
A group of companies dedicated to meeting
these requirements in the open
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Common Data Governance Use Cases
Financial Reporting
Chain of custody, Lineage Narratives
Telco
Device log management, Correlation, Analysis, and Mitigation
Retail
Point of sale analysis, Price optimization
Healthcare
30 day measures reporting
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Overview
We Do Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
New Project Proposal: Apache Atlas
Apache Atlas
Proposed open source project
aimed at solving the Hadoop
data governance challenge in
the open.
Key Capabilities
•  Data Classification
•  Metadata Exchange
•  Centralized Auditing
•  Search & Lineage (Browse)
•  Security & Policy Engine
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Essen%al	
  Timeline	
  
	
  
Phase-­‐3	
  
•  Collaboration Features
•  Self Service
•  Steward Delegation
•  Profiling & Pattern Analysis
•  Visualization	
  
Phase-­‐2
•  Advance audit reporting
•  Advanced Policy Engine
•  Row / Column Masking
•  3rd party Metadata exchange
	
  
1H	
  2015	
  GA	
  
•  Rest API
•  Centralized Taxonomy
•  Import / export metadata
•  Basic Policy Rules Engine
•  Real-time access control
•  Column Level Tagging
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Data Classification
•  Import or define taxonomy business-oriented annotations for data
•  Define, annotate, and automate capture of relationships between data sets and underlying
elements including source, target, and derivation processes
•  Export metadata to third-party systems
Centralized Auditing
•  Capture security access information for every application, process, and interaction with data
•  Capture the operational information for execution, steps, and activities
Search & Lineage (Browse)
•  Pre-defined navigation paths to explore the data classification and audit information
•  Text-based search features locates relevant data and audit event across Data Lake quickly
and accurately
•  Browse visualization of data set lineage allowing users to drill-down into operational, security,
and provenance related information
Security & Policy Engine
•  Rationalize compliance policy at runtime based on data classification schemes
•  Advanced definition of policies for preventing data derivation based on classification (i.e. re-
identification)
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Apache Atlas Overview
Knowledge Store
Knowledge store categorized with appropriate business-
oriented taxonomy
•  Data sets & objects
•  Tables / Columns
•  Logical context
•  Source, destination
Support exchange of metadata between foundation
components and third-party applications/governance tools
Leverages existing Hadoop metastores
Audit Store
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Knowledge Store
ModelsType-System
Policy RulesTaxonomies
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Data Lifecycle Management
Leverage existing investment in Apache Falcon with a
focus on:
•  Provenance
•  Multi-cluster replication
•  Data set retention/eviction
•  Late data handling
•  Automation
Audit Store
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Data Lifecycle
Management
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Audit Store
Historical repository for all governance events
•  Security: Access Grant & Deny
•  Operational: Data Provenance & Metrics
•  Indexed and Searchable
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Audit Store
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Security
Integration with HDP Advanced Security investments
to ensure compliance.
Establish global security policies based on data
classification.
Leverages Ranger plug-in architecture for policy
enforcement
Audit Store
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Security
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Policy Engine
Runtime rationalization of policies rules with respect to
data asset combinations and time. Fully extensible.
•  Metadata based
•  Geo based rules
•  Time-based rules
•  Hive Column Prohibitions
•  Preview: Hive Row and Column Masking
Audit Store
ModelsType-System
Taxonomies
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Policy Rules
Policy Engine
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
RESTful interface
•  Extensible enterprise classification of data assets,
relationships and policies organized in a meaningful
way -- aligned to business organization.
•  Supports exploration via user interface
•  Supports extensibility via API and CLI exposure
Audit Store
ModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Coming 2h 2015
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Enhanced Audit Store
Historical repository for all governance events
•  Immutable file format
•  Events Metadata Taggable
•  Advanced Reporting
•  Security: Access Grant & Deny
•  Operational: Data Provenance & Metrics
•  Indexed and SearchableModelsType-System
Policy RulesTaxonomies
Policy Engine
Data Lifecycle
Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Custom
CWM
Retail
PCI
PII
Other
Audit Store
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Data Classification
•  Import or define taxonomy business-oriented annotations for data
•  Define, annotate, and automate capture of relationships between data sets and underlying
elements including source, target, and derivation processes
•  Export metadata to third-party systems
Centralized Auditing
•  Capture security access information for every application, process, and interaction with data
•  Capture the operational information for execution, steps, and activities
Search & Lineage (Browse)
•  Pre-defined navigation paths to explore the data classification and audit information
•  Text-based search features locates relevant data and audit event across Data Lake quickly
and accurately
•  Browse visualization of data set lineage allowing users to drill-down into operational, security,
and provenance related information
Security & Policy Engine
•  Rationalize compliance policy at runtime based on data classification schemes
•  Advanced definition of policies for preventing data derivation based on classification (i.e. re-
identification)
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Governance Ready Certification Program
Curated group of vendor partners to provide
rich & complete features
Customers choose features that they want to
deploy – a la carte.
Low switching costs !
HDP at core to provide stability and
interoperability
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self
Service
Visual-
ization
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Waterline Data improves speed to value and
compliance
Data
Warehouse Offload
Data Science/
Analytics Sandbox
Data Lake
VALUE
CREATION
COST
SAVINGS
Deliver a
Business-Ready
Data Lake
Accelerate Data
Prep Process
Govern Data in
Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
The Modern Data Architecture
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM
Rest API
Business Glossary
Automated Classification (Tagging)
Automated Lineage Discovery
Profiling and Data Quality
Schema Discovery
Change Detection and Audit
•  Glossary
•  Tags
•  Lineage
•  Models
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Visual-ization
Governance Ready Certification Program
Discovery
Tagging
Prep /
Cleanse
ETL
Governance
BPM
Self
Service
Visual-
ization
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Imagine shopping on Amazon.com
GOVERNANCE
Inventory
Find and Understand
Provision
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Waterline Data is like Amazon.com for data in
Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Inventory
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find and Understand
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Provision
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Governance
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
Big Data IT Architect
Deliver a Business-
Ready Data Lake
Data Engineer/Data Scientist
Accelerate Data Prep
Process
CDO/Data Steward
Govern Data in
Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deliver a business-ready data lake
“It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a
service to help the business find, understand, and govern data in Hadoop.”
Joe DosSantos, EMC Big Data Practice Leader
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deliver a business-ready data lake
“It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a
service to help the business find, understand, and govern data in Hadoop.”
Joe DosSantos, EMC Big Data Practice Leader
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Accelerate data prep process
“80% of Big Data analytics is data prep, and 80% of data prep is inventorying data.”
Data Engineering Director, Financial Services
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Accelerate data prep process
"Waterline Data fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data,
which in turn can help analytic teams provision the right data for their analyses.”
Tony Baer, Principal Analyst, Ovum
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Govern data in Hadoop
“Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by
other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any
data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a
data swamp. And without metadata, every subsequent use of data means analysts start from scratch.”
“Gartner Says Beware of the Data Lake Fallacy” post on the Gartner website
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Govern data in Hadoop
“The first step to governing Big Data is to build an inventory.”
Sunil Soares, Managing Partner, Information Asset
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practice approach to implement an enterprise
grade data lake
6. Monitor and maintain
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise metadata repository
2. Build inventory of data
1. Create and populate landing area
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
1. Create and populate landing
area
1
1
•  Create Landing directory structure
•  Set up ETL processes using
Falcon to orchestrate
•  Implement ETL jobs using ETL
tools (Syncsort, Talend,
Informatica, etc), Hadoop tools
(Sqoop, Flume, etc) or FTP
Falcon
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
2. Build inventory of data
1. Create and populate landing
area
2
•  Crawl the cluster
•  Profile files
•  Automatically discover technical,
business, and compliance
metadata at a field level
•  Create Hive tables as needed
•  Import lineage
•  Export to Atlas
2
2
Falcon
HCatalog
Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
3
3
•  Import business glossary terms
and export new tags and updated
definitions
•  Synchronize Atlas and Waterline
Data Inventory
•  Export metadata and lineage from
Hadoop to Enterprise repository
Falcon
HCatalog
Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
4. Protect sensitive data
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
4
•  Use Waterline Data Inventory to
find sensitive data
•  Create access privileges in Ranger
•  Encrypt or de-identify
HCatalog
Ranger
Falcon
Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
5
5
5
•  Create account with Kerberos,
LDAP, etc.
•  Set up ACLs (leverage Ranger)
•  Users can browse securely through
Waterline Data Inventory
5
HCatalog
Ranger
Falcon
Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
6. Monitor and maintain
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise
metadata repository
2. Build inventory of data
1. Create and populate landing
area
•  Continue profiling new or changed
files and sync with Atlas
•  Continue monitoring for sensitive
data, use Ranger to protect
•  Build a folksonomy and
synchronize with business glossary
in Atlas and Enterprise Business
Glossary
HCatalog
Ranger
Falcon
Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
Discover lineage and
business metadata
automatically, and
manage metadata
CDO/Data Steward
Automate cataloging of
data assets at scale,
with secure
provisioning to
business users
Big Data Architect
Find and understand
best-suited and most
trusted data without
having to explore
every file manually
Data Engineer/Data
Scientist/Business Analyst
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions and Answers
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Next Steps…
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about Waterline Data & Hortonworks
http://hortonworks.com/partner/waterline-data
Joint tutorial: bit.ly/DataLakeTutorial
Modern Data Architecture Paper: go.waterlinedata.com/hw-mda
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
SAN JOSE
June 9-11
BRUSSELS
April 15-16
•  Deep-dive technical content
•  65+ sessions and 5 tracks
•  1,000 attendees
•  Sponsorships Available
•  Including Pre and Post event community meetups
and BOFs
•  Hadoop training available
•  100+ sessions and 7 tracks
•  Deep-dive technical content
•  5,000 attendees
•  Sponsorships Available
•  Including Pre and Post event community meetups
and BOFs
•  Hadoop training available
www.hadoopsummit.org
The Largest Hadoop Community Events in 

Europe and North America

More Related Content

What's hot

What's hot (20)

Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinar
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
 

Viewers also liked

Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
EMC
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
EMC
 

Viewers also liked (8)

BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2
 
Business intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lakeBusiness intelligence 3.0 and the data lake
Business intelligence 3.0 and the data lake
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 

Similar to Implementing a Data Lake with Enterprise Grade Data Governance

Building a data-driven authorization framework
Building a data-driven authorization frameworkBuilding a data-driven authorization framework
Building a data-driven authorization framework
DataWorks Summit
 

Similar to Implementing a Data Lake with Enterprise Grade Data Governance (20)

Data Governance Initiative
Data Governance InitiativeData Governance Initiative
Data Governance Initiative
 
Atlas and ranger epam meetup
Atlas and ranger epam meetupAtlas and ranger epam meetup
Atlas and ranger epam meetup
 
Enterprise Data Classification and Provenance
Enterprise Data Classification and ProvenanceEnterprise Data Classification and Provenance
Enterprise Data Classification and Provenance
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
Building a data-driven authorization framework
Building a data-driven authorization frameworkBuilding a data-driven authorization framework
Building a data-driven authorization framework
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
Simplify and Secure your Hadoop Environment with Hortonworks and Centrify
Simplify and Secure your Hadoop Environment with Hortonworks and CentrifySimplify and Secure your Hadoop Environment with Hortonworks and Centrify
Simplify and Secure your Hadoop Environment with Hortonworks and Centrify
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Apache Atlas. Data Governance for Hadoop. Strata London 2015
Apache Atlas. Data Governance for Hadoop. Strata London 2015Apache Atlas. Data Governance for Hadoop. Strata London 2015
Apache Atlas. Data Governance for Hadoop. Strata London 2015
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto Meetup
 

More from Hortonworks

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Implementing a Data Lake with Enterprise Grade Data Governance

  • 1. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Implementing a Data Lake with Enterprise Grade Data Governance We do Hadoop.
  • 2. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Your speakers Andrew Ahn Governance Product Manager, Hortonworks Oliver Claude CMO at Waterline
  • 3. © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP: Data Governance We Do Hadoop
  • 4. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Data Governance Goals GOAL: Provide a common approach to data governance across all systems and data within the organization •  Transparent Governance standards & protocols must be clearly defined and available to all •  Reproducible Recreate the relevant data landscape at a point in time •  Auditable All relevant events and assets but be traceable with appropriate historical lineage •  Consistent Compliance practices must be consistent ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Governance Framework
  • 5. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance Challenges WITHIN Hadoop •  No comprehensive governance within the Hadoop stack •  Mostly disjoint as each project defines its own future and there is no common framework •  Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution •  No integration with external governance frameworks •  Difficult to get right because each project is autonomous and you need insight into traditional IT ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm
  • 6. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance Initiative for Hadoop ETL/DQ BPM Business Analytics Visualization & Dashboards ERP CRM SCM MDM ARCHIVE Data Governance Initiative Common Governance Framework 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ApachePig ApacheHive ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm TWO Requirements 1.  Hadoop must snap in to the existing frameworks and be a good citizen 2.  Hadoop must also provide governance within its own stack of technologies A group of companies dedicated to meeting these requirements in the open
  • 7. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Common Data Governance Use Cases Financial Reporting Chain of custody, Lineage Narratives Telco Device log management, Correlation, Analysis, and Mitigation Retail Point of sale analysis, Price optimization Healthcare 30 day measures reporting
  • 8. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Overview We Do Hadoop
  • 9. © Hortonworks Inc. 2011 – 2014. All Rights Reserved New Project Proposal: Apache Atlas Apache Atlas Proposed open source project aimed at solving the Hadoop data governance challenge in the open. Key Capabilities •  Data Classification •  Metadata Exchange •  Centralized Auditing •  Search & Lineage (Browse) •  Security & Policy Engine Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM Essen%al  Timeline     Phase-­‐3   •  Collaboration Features •  Self Service •  Steward Delegation •  Profiling & Pattern Analysis •  Visualization   Phase-­‐2 •  Advance audit reporting •  Advanced Policy Engine •  Row / Column Masking •  3rd party Metadata exchange   1H  2015  GA   •  Rest API •  Centralized Taxonomy •  Import / export metadata •  Basic Policy Rules Engine •  Real-time access control •  Column Level Tagging
  • 10. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Data Classification •  Import or define taxonomy business-oriented annotations for data •  Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes •  Export metadata to third-party systems Centralized Auditing •  Capture security access information for every application, process, and interaction with data •  Capture the operational information for execution, steps, and activities Search & Lineage (Browse) •  Pre-defined navigation paths to explore the data classification and audit information •  Text-based search features locates relevant data and audit event across Data Lake quickly and accurately •  Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information Security & Policy Engine •  Rationalize compliance policy at runtime based on data classification schemes •  Advanced definition of policies for preventing data derivation based on classification (i.e. re- identification) Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM
  • 11. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Apache Atlas Overview Knowledge Store Knowledge store categorized with appropriate business- oriented taxonomy •  Data sets & objects •  Tables / Columns •  Logical context •  Source, destination Support exchange of metadata between foundation components and third-party applications/governance tools Leverages existing Hadoop metastores Audit Store Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Knowledge Store ModelsType-System Policy RulesTaxonomies
  • 12. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Data Lifecycle Management Leverage existing investment in Apache Falcon with a focus on: •  Provenance •  Multi-cluster replication •  Data set retention/eviction •  Late data handling •  Automation Audit Store ModelsType-System Policy RulesTaxonomies Policy Engine Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Data Lifecycle Management
  • 13. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Audit Store Historical repository for all governance events •  Security: Access Grant & Deny •  Operational: Data Provenance & Metrics •  Indexed and Searchable ModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Audit Store
  • 14. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Security Integration with HDP Advanced Security investments to ensure compliance. Establish global security policies based on data classification. Leverages Ranger plug-in architecture for policy enforcement Audit Store ModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Security
  • 15. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Policy Engine Runtime rationalization of policies rules with respect to data asset combinations and time. Fully extensible. •  Metadata based •  Geo based rules •  Time-based rules •  Hive Column Prohibitions •  Preview: Hive Row and Column Masking Audit Store ModelsType-System Taxonomies Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Policy Rules Policy Engine
  • 16. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview RESTful interface •  Extensible enterprise classification of data assets, relationships and policies organized in a meaningful way -- aligned to business organization. •  Supports exploration via user interface •  Supports extensibility via API and CLI exposure Audit Store ModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other
  • 17. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Coming 2h 2015
  • 18. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Knowledge Store Apache Atlas Overview Enhanced Audit Store Historical repository for all governance events •  Immutable file format •  Events Metadata Taggable •  Advanced Reporting •  Security: Access Grant & Deny •  Operational: Data Provenance & Metrics •  Indexed and SearchableModelsType-System Policy RulesTaxonomies Policy Engine Data Lifecycle Management Security REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Custom CWM Retail PCI PII Other Audit Store
  • 19. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary
  • 20. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Data Classification •  Import or define taxonomy business-oriented annotations for data •  Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes •  Export metadata to third-party systems Centralized Auditing •  Capture security access information for every application, process, and interaction with data •  Capture the operational information for execution, steps, and activities Search & Lineage (Browse) •  Pre-defined navigation paths to explore the data classification and audit information •  Text-based search features locates relevant data and audit event across Data Lake quickly and accurately •  Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information Security & Policy Engine •  Rationalize compliance policy at runtime based on data classification schemes •  Advanced definition of policies for preventing data derivation based on classification (i.e. re- identification) Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM
  • 21. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Governance Ready Certification Program Curated group of vendor partners to provide rich & complete features Customers choose features that they want to deploy – a la carte. Low switching costs ! HDP at core to provide stability and interoperability Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visual- ization
  • 22. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Waterline Data improves speed to value and compliance Data Warehouse Offload Data Science/ Analytics Sandbox Data Lake VALUE CREATION COST SAVINGS Deliver a Business-Ready Data Lake Accelerate Data Prep Process Govern Data in Hadoop
  • 23. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find, understand and govern data in Hadoop
  • 24. © Hortonworks Inc. 2011 – 2014. All Rights Reserved The Modern Data Architecture
  • 25. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Atlas Capabilities: Overview Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag Based Access Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM Rest API Business Glossary Automated Classification (Tagging) Automated Lineage Discovery Profiling and Data Quality Schema Discovery Change Detection and Audit •  Glossary •  Tags •  Lineage •  Models
  • 26. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Visual-ization Governance Ready Certification Program Discovery Tagging Prep / Cleanse ETL Governance BPM Self Service Visual- ization
  • 27. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Imagine shopping on Amazon.com GOVERNANCE Inventory Find and Understand Provision
  • 28. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Waterline Data is like Amazon.com for data in Hadoop GOVERNANCE Inventory Find and Understand Provision
  • 29. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Inventory
  • 30. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find and Understand
  • 31. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Provision
  • 32. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Governance
  • 33. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find, understand and govern data in Hadoop Big Data IT Architect Deliver a Business- Ready Data Lake Data Engineer/Data Scientist Accelerate Data Prep Process CDO/Data Steward Govern Data in Hadoop
  • 34. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader
  • 35. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader
  • 36. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Accelerate data prep process “80% of Big Data analytics is data prep, and 80% of data prep is inventorying data.” Data Engineering Director, Financial Services
  • 37. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Accelerate data prep process "Waterline Data fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data, which in turn can help analytic teams provision the right data for their analyses.” Tony Baer, Principal Analyst, Ovum
  • 38. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Govern data in Hadoop “Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.” “Gartner Says Beware of the Data Lake Fallacy” post on the Gartner website
  • 39. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Govern data in Hadoop “The first step to governing Big Data is to build an inventory.” Sunil Soares, Managing Partner, Information Asset
  • 40. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practice approach to implement an enterprise grade data lake 6. Monitor and maintain 5. Open up to users 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area
  • 41. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 1. Create and populate landing area 1 1 •  Create Landing directory structure •  Set up ETL processes using Falcon to orchestrate •  Implement ETL jobs using ETL tools (Syncsort, Talend, Informatica, etc), Hadoop tools (Sqoop, Flume, etc) or FTP Falcon
  • 42. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 2. Build inventory of data 1. Create and populate landing area 2 •  Crawl the cluster •  Profile files •  Automatically discover technical, business, and compliance metadata at a field level •  Create Hive tables as needed •  Import lineage •  Export to Atlas 2 2 Falcon HCatalog Atlas
  • 43. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area 3 3 •  Import business glossary terms and export new tags and updated definitions •  Synchronize Atlas and Waterline Data Inventory •  Export metadata and lineage from Hadoop to Enterprise repository Falcon HCatalog Atlas
  • 44. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area 4 •  Use Waterline Data Inventory to find sensitive data •  Create access privileges in Ranger •  Encrypt or de-identify HCatalog Ranger Falcon Atlas
  • 45. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 5. Open up to users 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area 5 5 5 •  Create account with Kerberos, LDAP, etc. •  Set up ACLs (leverage Ranger) •  Users can browse securely through Waterline Data Inventory 5 HCatalog Ranger Falcon Atlas
  • 46. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Best practices in deployment landscape 6. Monitor and maintain 5. Open up to users 4. Protect sensitive data 3. Integrate with enterprise metadata repository 2. Build inventory of data 1. Create and populate landing area •  Continue profiling new or changed files and sync with Atlas •  Continue monitoring for sensitive data, use Ranger to protect •  Build a folksonomy and synchronize with business glossary in Atlas and Enterprise Business Glossary HCatalog Ranger Falcon Atlas
  • 47. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Find, understand and govern data in Hadoop Discover lineage and business metadata automatically, and manage metadata CDO/Data Steward Automate cataloging of data assets at scale, with secure provisioning to business users Big Data Architect Find and understand best-suited and most trusted data without having to explore every file manually Data Engineer/Data Scientist/Business Analyst
  • 48. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Questions and Answers
  • 49. © Hortonworks Inc. 2011 – 2014. All Rights Reserved Next Steps… Download the Hortonworks Sandbox Learn Hadoop Build Your Analytic App Try Hadoop 2 More about Waterline Data & Hortonworks http://hortonworks.com/partner/waterline-data Joint tutorial: bit.ly/DataLakeTutorial Modern Data Architecture Paper: go.waterlinedata.com/hw-mda
  • 50. © Hortonworks Inc. 2011 – 2014. All Rights Reserved SAN JOSE June 9-11 BRUSSELS April 15-16 •  Deep-dive technical content •  65+ sessions and 5 tracks •  1,000 attendees •  Sponsorships Available •  Including Pre and Post event community meetups and BOFs •  Hadoop training available •  100+ sessions and 7 tracks •  Deep-dive technical content •  5,000 attendees •  Sponsorships Available •  Including Pre and Post event community meetups and BOFs •  Hadoop training available www.hadoopsummit.org The Largest Hadoop Community Events in 
 Europe and North America