Do you know where is your data ?
Do you know who is responsible of this specific datasets ?
Do you know from which application or task this entity was modified last friday ?
Apache Atlas helps you to manage all your metadata of your data. With Apache Atlas you can know all lineages between your datasets and process that use them.
Do we need a new standard for visualizing the invisible?
Manage tracability with Apache Atlas, a flexible metadata repository
1. Copyright Synaltic 2015
Manage tracability with
Apache Atlas,
a flexible metadata repository
Charly Clairmont
Synaltic
@egwada
cclairmont@synaltic.fr
http://synaltic.fr
2. Copyright Synaltic 2015
More than ten years experience in IT mainly in BI
Cofounder of Altic, now Synaltic
Cofounder of the Hadoop User Groupe France
Belives in Open Source to help enterprises to create value
Helps open source projects to be known
via meetups and conference
Charly Clairmont
2
3. Copyright Synaltic 2015
An integrator company mainly focused in Data Management
Founded in 2004, Synaltic is the merge of two companies Synotis and
Altic
25 specialists in Data Management
A Swiss subsidiary, installed in Lausanne
Our values
●
Commitment
●
Expertise
●
Loyalty
Synaltic
3
R&D
Training
SupportProject
Expertise
Data
Intelligence
Data
Platform
Data
Governance
Data
Exchange
SYNALTIC
4. Copyright Synaltic 2015
What about your Data ?
4
Do you know where is your data ?
Do you know who is responsible of this specific
datasets ?
Do you know from which application or task this entity
was modified last friday ?
5. Copyright Synaltic 2015
Enterprise Data Governance
Provide a common approach to
data governance across all
systems and data within the
organization
– Transparent
– Reproductible
– Auditable
– Consistent
6. Copyright Synaltic 2015
Enterprise Data Governance, in Hadoop
No specific way to address this
requirement
– Each project proposes its
own way to resolve data
governance
– No integration with some
existing entreprise
frameworks for data
governance
8. Copyright Synaltic 2015
Apache Atlas, Overview
Data Classification
●
Taxonomy business-oriented annotations
●
Relationships between data sets and underlying elements
including source, target, and derivation processes
●
Export metadata to third-party systems
Centralized Auditing
●
Security access information for every application, process
●
Operational information for execution, steps, and activities
Search & Lineage (Browse)
●
Navigation paths to explore the data classification and
audit information
●
Text-based search to locate what is relevant
●
Visualization of data set lineage
Security & Policy Engine
●
Compliance policy at runtime based on data classification
schemes
●
Advanced definition of policies for preventing data
derivation
9. Copyright Synaltic 2015
Apache Atlas, Knowledge Store
Knowledge store categorized with appropriate
business-oriented taxonomy
●
Data sets & objects
●
Tables / Columns
●
Logical context
●
Source, destination
Support exchange of metadata between foundation
components and third-party
applications/governance tools
Tech:
Titan with Apache HBase
10. Copyright Synaltic 2015
Apache Atlas, Data Lifecycle Management
Provenance
Multi-cluster replication
Data set retention/eviction
Late data handling
Automation
Tech:
●
Apache Falcon
11. Copyright Synaltic 2015
Apache Atlas, Audit Store
Historical repository for all
governance events
●
Security: Access Grant & Deny
●
Operational: Data Provenance &
Metrics
●
Indexed and Searchable
Tech:
●
YARN ATS, Apache HBase, Apache Hive, Solr,
ElasticSearch
(Pluggable)
13. Copyright Synaltic 2015
Apache Atlas, Policy Engine
Runtime rationalization of policies rules
with respect to data asset combinations
and time. Fully extensible.
●
Metadata based
●
Geo based rules
●
Time-based rules
●
Column /Attribute Prohibitions
●
Preview: Hive Row and Column Masking
Tech:
●
Ranger
14. Copyright Synaltic 2015
Apache Atlas, RESTful interface
Extensible enterprise classification of
data assets, relationships and policies
organized in a meaningful way -- aligned
to business organization.
Supports exploration via user interface
Supports extensibility via API and CLI
exposure
15. Copyright Synaltic 2015
A use case
Our process
ImportImport
TwitterTwitter
HDFS :
Raw
data
HDFS :
Raw
data
Data source
RéférentielRéférentiel
Collect
from
twitter
Hive:
url
Hive:
url
Hive:
Hash tags
Hive:
Hash tags
Hive:
users
Hive:
users AnalyseAnalyse
Build
social network
Hive:
tweets
Hive:
tweets
Hive:
Social
network
Hive:
Social
network
Data Platform