Fifth Elephant Apache Atlas Talk

•

5 j'aime•2,018 vues

Vimal Sharma

Proposal for the talk on Apache Atlas at Fifth Elephant Conference 2017

Ingénierie

Governance using
Apache Atlas: Why and How
Vimal Sharma, Apache Atlas PMC & Committer
Software Engineer, Hortonworks
Apache ID: svimal2106@apache.org

Apache Atlas : Project Details
Ø Incubated to Apache in May 2015
Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target
Ø 3 releases in last year
Ø Graduated to a Top Level Project in June 2017
0.7
(July 2016)
0.7.1
(Jan 2017)
0.8
(Mar 2017)
TLP
(June 2017)

Apache Atlas : Introduction
Ø Governance and metadata framework for Hadoop
Ø Model a component and capture metadata
Ø Data Assets - Hive Table, HBase column family
Ø Process - Storm Topology, Sqoop Import
Ø Classification - Tag metadata entities
Ø Built in support for popular components
Ø Extensible Architecture

Integration
Core
Apps
Type System
Graph Abstraction/Engine
API
<HTTP/REST>
Titan
Metadata
Store <HBase>
Index Store
<Solr>
UI
Business Glossary
(Roadmap)
Metadata Sources
Messaging
<Kafka>
Hive Sqoop Storm Custom Ranger Tag Based
Policies
Ingest / Export Search
Apache Atlas: Architecture

Governance Problem (Use Cases)
Ø ETL Pipeline Failure Scenarios
• Upstream failure analysis
• Alerts to downstream processes
• Visual lineage of ETL pipelines
Ø Redundant Processing
• Does a derived dataset contain required information
• Can metadata classification be used to determine this?
• Avoid expensive processing

Use Cases
Ø Compliance and Security
• Impose security constraints on sensitive data
• Data can span multiple Hadoop components
• One policy to govern them all
Ø Cluster Admin
• Periodic cleanup of datasets
• Which are the unused/dormant datasets
• How to define the relevance of a dataset

Cross Component Lineage
• Lineage : Upstream and downstream Data Assets
• Individual Components : Own Metadata store
• Cross Component events
• Atlas : Flexibility to model arbitrary components

Ranger Integration
• Ranger : Listener on Tag addition/deletion
• Attribute based policies rather than asset based policies
PII

Type System
• Model of metadata to be stored
• Every type has
Ø Unique Name
Ø Attributes
Ø SuperTypes
• Attributes
Ø Mandatory/Optional
Ø Unique
Ø Composite
Ø ReverseReference

Atlas Base Types
Referenceable
DataSet Process
Asset
Name
Owner
Description
qualifiedName
Inputs
Outputs

Spark Introduction
• RDD : Basic Unit of execution
• DataFrame : Relational RDD
• Let’s model DataFrame type!

DataFrame Type
DataSet
spark_dataframe dataframe_column
source
destination
columns
type
dataframe
comment

Graph Snapshot
Ø 1: Dataframe Type
Ø 2: Column Type
Ø 3: Dataframe Entity
Ø 4, 5: Column Entities
3
4
5
1
2
/hdfs/source
/hdfs/destination
employeeInfo@Hortonworks
name
id

Demo Example
PayrollDetails
(HDFS PATH)
SalaryProcessor
(DATAFRAME)
EmployeeSalary
(KAKFA TOPIC)
VariableComponent
(HDFS PATH)

Hook Design
Ø Hive Hook
• Multiple clients e.g Pig, Hive, Beeline
• Always full update to avoid inconsistency
Ø Synchronous vs Asynchronous communication
• Earlier : Hook communicated with server directly
• Now : Metadata entities pushed to Kafka
Ø Un-partitioned Kafka topic
• Avoid out of order messages

Roadmap
Ø Hooks for Spark, HBase and NiFi
Ø Column level lineage for Hive
• create table dest as select (a + b) x, (c * d) y from source
Ø Export/Import of metadata
a
b
Addition x

Contribute
Ø Project Website - http://atlas.apache.org/
Ø Dev Mailing List - dev@atlas.incubator.apache.org
Ø User Mailing List - user@atlas.incubator.apache.org
Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS

Contenu connexe

Tendances

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit

PySpark dataframeJaemun Jung

Introduction to Spark with PythonGokhan Atil

Azure Data Factory v2inovex GmbH

Deep Dive on Amazon RedshiftAmazon Web Services

Catalyst optimizerAyub Mohammad

Presto Summit 2018 - 09 - Netflix Icebergkbajda

MongoDB at ScaleMongoDB

Deep Dive into the New Features of Apache Spark 3.1Databricks

Cssmohamed ashraf

Parquet Hadoop Summit 2013Julien Le Dem

Relational databases vs Non-relational databasesJames Serra

Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit

Data engineeringParimala Killada

Hybrid Apache Spark Architecture with YARN and KubernetesDatabricks

Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb

High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB

Introduction to ML with Apache Spark MLlibTaras Matyashovsky

Presto: SQL-on-anythingDataWorks Summit

Delta Lake with Azure DatabricksDustin Vannoy

Tendances (20)

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...

PySpark dataframe

Introduction to Spark with Python

Azure Data Factory v2

Deep Dive on Amazon Redshift

Catalyst optimizer

Presto Summit 2018 - 09 - Netflix Iceberg

MongoDB at Scale

Deep Dive into the New Features of Apache Spark 3.1

Css

Parquet Hadoop Summit 2013

Relational databases vs Non-relational databases

Real-time Hadoop: The Ideal Messaging System for Hadoop

Data engineering

Hybrid Apache Spark Architecture with YARN and Kubernetes

Introduction to DataFusion An Embeddable Query Engine Written in Rust

High-speed Database Throughput Using Apache Arrow Flight SQL

Introduction to ML with Apache Spark MLlib

Presto: SQL-on-anything

Delta Lake with Azure Databricks

Similaire à Fifth Elephant Apache Atlas Talk

Atlas ApacheCon 2017Vimal Sharma

HDP Next: GovernanceDataWorks Summit

Data Governance - Atlas 7.12.2015Hortonworks

An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit

Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services

Tag based policies using Apache Atlas and RangerVimal Sharma

Discovery & Consumption of Analytics Data @TwitterKamran Munshi

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services

From discovering to trusting datamarkgrover

Log Data Analysis Platform by Valentin KropovSoftServe

Log Data Analysis PlatformValentin Kropov

Data Governance InitiativeDataWorks Summit

IaaS, PaaS, and DevOps for Data ScientistDmitry Petukhov

A Look into the Apache OODT EcosystemChris Mattmann

Apache Eagle Architecture EvolvementHao Chen

Tableau and hadoopCraig Jordan

AWS March 2016 Webinar Series Building Your Data Lake on AWS Amazon Web Services

Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling

How to govern and secure a Data Mesh?confluent

Tthornton code4libtrevorthornton

Similaire à Fifth Elephant Apache Atlas Talk (20)

Atlas ApacheCon 2017

HDP Next: Governance

Data Governance - Atlas 7.12.2015

An architecture for federated data discovery and lineage over on-prem datasou...

Scalable Data Analytics - DevDay Austin 2017 Day 2

Tag based policies using Apache Atlas and Ranger

Discovery & Consumption of Analytics Data @Twitter

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2

From discovering to trusting data

Log Data Analysis Platform by Valentin Kropov

Log Data Analysis Platform

Data Governance Initiative

IaaS, PaaS, and DevOps for Data Scientist

A Look into the Apache OODT Ecosystem

Apache Eagle Architecture Evolvement

Tableau and hadoop

AWS March 2016 Webinar Series Building Your Data Lake on AWS

Metadata and Provenance for ML Pipelines with Hopsworks

How to govern and secure a Data Mesh?

Tthornton code4lib

Dernier

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth

UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan

Porous Ceramics seminar and technical writingrakeshbaidya232001

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

UNIT-II FMM-Flow Through Circular Conduitsrknatarajan

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

UNIT - IV - Air Compressors and its Performancesivaprakash250

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor

Dernier (20)

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

UNIT-III FMM. DIMENSIONAL ANALYSIS

Porous Ceramics seminar and technical writing

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

Roadmap to Membership of RICS - Pathways and Routes

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

UNIT-II FMM-Flow Through Circular Conduits

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

UNIT - IV - Air Compressors and its Performance

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt

Microscopic Analysis of Ceramic Materials.pptx

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130

Fifth Elephant Apache Atlas Talk

1. Governance using Apache Atlas: Why and How Vimal Sharma, Apache Atlas PMC & Committer Software Engineer, Hortonworks Apache ID: svimal2106@apache.org

2. Apache Atlas : Project Details Ø Incubated to Apache in May 2015 Ø Organizations : IBM, Hortonworks, Aetna, Merck, Target Ø 3 releases in last year Ø Graduated to a Top Level Project in June 2017 0.7 (July 2016) 0.7.1 (Jan 2017) 0.8 (Mar 2017) TLP (June 2017)

3. Apache Atlas : Introduction Ø Governance and metadata framework for Hadoop Ø Model a component and capture metadata Ø Data Assets - Hive Table, HBase column family Ø Process - Storm Topology, Sqoop Import Ø Classification - Tag metadata entities Ø Built in support for popular components Ø Extensible Architecture

4. Integration Core Apps Type System Graph Abstraction/Engine API <HTTP/REST> Titan Metadata Store <HBase> Index Store <Solr> UI Business Glossary (Roadmap) Metadata Sources Messaging <Kafka> Hive Sqoop Storm Custom Ranger Tag Based Policies Ingest / Export Search Apache Atlas: Architecture

5. Governance Problem (Use Cases) Ø ETL Pipeline Failure Scenarios • Upstream failure analysis • Alerts to downstream processes • Visual lineage of ETL pipelines Ø Redundant Processing • Does a derived dataset contain required information • Can metadata classification be used to determine this? • Avoid expensive processing

6. Use Cases Ø Compliance and Security • Impose security constraints on sensitive data • Data can span multiple Hadoop components • One policy to govern them all Ø Cluster Admin • Periodic cleanup of datasets • Which are the unused/dormant datasets • How to define the relevance of a dataset

7. Cross Component Lineage • Lineage : Upstream and downstream Data Assets • Individual Components : Own Metadata store • Cross Component events • Atlas : Flexibility to model arbitrary components

8. Ranger Integration • Ranger : Listener on Tag addition/deletion • Attribute based policies rather than asset based policies PII

9. Type System • Model of metadata to be stored • Every type has Ø Unique Name Ø Attributes Ø SuperTypes • Attributes Ø Mandatory/Optional Ø Unique Ø Composite Ø ReverseReference

10. Atlas Base Types Referenceable DataSet Process Asset Name Owner Description qualifiedName Inputs Outputs

11. Spark Introduction • RDD : Basic Unit of execution • DataFrame : Relational RDD • Let’s model DataFrame type!

12. DataFrame Type DataSet spark_dataframe dataframe_column source destination columns type dataframe comment

13. Graph Snapshot Ø 1: Dataframe Type Ø 2: Column Type Ø 3: Dataframe Entity Ø 4, 5: Column Entities 3 4 5 1 2 /hdfs/source /hdfs/destination employeeInfo@Hortonworks name id

14. Demo Example PayrollDetails (HDFS PATH) SalaryProcessor (DATAFRAME) EmployeeSalary (KAKFA TOPIC) VariableComponent (HDFS PATH)

15. Hook Design Ø Hive Hook • Multiple clients e.g Pig, Hive, Beeline • Always full update to avoid inconsistency Ø Synchronous vs Asynchronous communication • Earlier : Hook communicated with server directly • Now : Metadata entities pushed to Kafka Ø Un-partitioned Kafka topic • Avoid out of order messages

16. Roadmap Ø Hooks for Spark, HBase and NiFi Ø Column level lineage for Hive • create table dest as select (a + b) x, (c * d) y from source Ø Export/Import of metadata a b Addition x

17. Contribute Ø Project Website - http://atlas.apache.org/ Ø Dev Mailing List - dev@atlas.incubator.apache.org Ø User Mailing List - user@atlas.incubator.apache.org Ø JIRA link - https://issues.apache.org/jira/browse/ATLAS

18. Questions

Fifth Elephant Apache Atlas Talk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Fifth Elephant Apache Atlas Talk

Similaire à Fifth Elephant Apache Atlas Talk (20)

Dernier

Dernier (20)

Fifth Elephant Apache Atlas Talk