The community for Apache Atlas and Apache Ranger, which are foundational components for security and governance across the Hadoop stack, has spawned a robust partner ecosystem of tools and platforms. Such partner solutions build upon the extensibility offered in these platforms via open and robust APIs via integration patterns to provide innovative “better-together” capabilities. In this talk, we will showcase how the ecosystem of partners is building value-added capabilities to address GDPR based on Apache Ranger and Apache Atlas frameworks to complement the Hadoop ecosystem. The talk will showcase multiple ecosystem partner demonstrations that will include how to identify, map, and classify personal data, harvest and maintain metadata, track and map the movement of data through your enterprise, and enforce appropriate controls to monitor access and usage of personal data to help organizations address GDPR. We will also provide a short overview of Gov Ready and Sec Ready programs and how partners can benefit from the certification process as part of this program.
Speakers
Ali Bajwa, Principal Solutions Engineer, Hortonworks
Srikanth Venkat, Senior Director Product Management, Hortonworks
33. Lineage in DMX-h – ingestion to the cluster
DMX-h job executes
• In the cluster Sources/Targets: HDFS, Hive, S3
• Out of the cluster Sources/Targets: Mainframe, DBMSs, local and remote FS –
Syncsort External Datasets
DMX-h job collects lineage information
• Source/Target File or Table level
DMX-h job lineage is published into Apache Atlas
• Connect with lineage published from other tools (REST)
33Syncsort Confidential and Proprietary - do not copy or distribute
34. Syncsort DMX-h Atlas Integration
34Syncsort Confidential and Proprietary - do not copy or distribute
35. Govern and Track Everything for Compliance
• Metadata and data lineage for Hive, Avro and
Parquet through HCatalog
• Metadata lineage export and API from DMX/DMX-h
– Simplify audits, analytics dashboards, metrics
– Integrate with enterprise metadata repositories
• Apache Ambari integration
– Native LDAP and Kerberos support
– Secure mainframe data access through FTPS and
Connect:Direct
• Apache Atlas ingestion lineage integration
– Audit and track data from source to cluster
– Lineage & tagging of Metadata for GDPR
Compliance
35Syncsort Confidential and Proprietary - do not copy or distribute
36. End-to-End Data Lineage in Apache Atlas
36Syncsort Confidential and Proprietary - do not copy or distribute
Data Sources
37. End-to-End Data Lineage in Apache Atlas
37Syncsort Confidential and Proprietary - do not copy or distribute
Data Sources
Syncsort accesses
data from sources
outside cluster.
38. End-to-End Data Lineage in Apache Atlas
38Syncsort Confidential and Proprietary - do not copy or distribute
Syncsort onboards
data, modifies
on-the-fly to match
Hadoop storage
model.
Data Sources
Syncsort accesses
data from sources
outside cluster.
39. End-to-End Data Lineage in Apache Atlas
39Syncsort Confidential and Proprietary - do not copy or distribute
Syncsort onboards
data, modifies
on-the-fly to match
Hadoop storage
model.
Data Sources
Syncsort accesses
data from sources
outside cluster.
Syncsort changes,
enhances, joins
data in cluster with
MapReduce or
Spark.
Data Hub
40. End-to-End Data Lineage in Apache Atlas
40Syncsort Confidential and Proprietary - do not copy or distribute
Syncsort onboards
data, modifies
on-the-fly to match
Hadoop storage
model.
Data Sources
Syncsort accesses
data from sources
outside cluster.
Syncsort changes,
enhances, joins
data in cluster with
MapReduce or
Spark.
Syncsort passes
source-to-cluster
data lineage info
to Atlas.
Data Hub
41. End-to-End Data Lineage in Apache Atlas
41Syncsort Confidential and Proprietary - do not copy or distribute
Syncsort onboards
data, modifies
on-the-fly to match
Hadoop storage
model.
Data Sources
Syncsort accesses
data from sources
outside cluster.
Syncsort changes,
enhances, joins
data in cluster with
MapReduce or
Spark.
Analytics and
visualizations
get complete
data.
Data analyst
gets end-to-
end data
lineage info
from Atlas
Syncsort passes
source-to-cluster
data lineage info
to Atlas.
Data Hub
Analytics,
Visualization
42. Syncsort: High Performance Import from Existing Databases
42
• Connect to virtually any data source, including
mainframe and MPP databases.
• Move data into and out of Hadoop up to 6x
faster without the need for manual scripts.
• Develop ETL processes without writing code.
• Seamlessly accelerate Hadoop performance and
scalability for ETL operations in both
MapReduce and Spark.
Benefits
Syncsort Confidential and Proprietary - do not copy or distribute
43. Syncsort + Hortonworks Advantages
• Apache Ambari Integration
• Deploy DMX-h across cluster
• Monitor DMX-h jobs
• Process in MapReduce or Spark
• Source relational and non relational data
(including mainframes)
• Out-of-the-box integration, interoperability &
certifications
• Kerberos-secured clusters
• Apache Ranger security certified
• Early beta, release certification
• Metadata lineage export from DMX
• Supports easy identification and management
of GDPR relevant Metadata
Technical Benefits
43Syncsort Confidential and Proprietary - do not copy or distribute
62. GDPR
Be transparent with all Pii data
Why not turn GDPR into a new
customer experience?
Dataworks Summit Berlin 2018
Jan-Kees Buenen, CEO
62(C) 2018 SynerScope
63. Discover
and classify
data
content in
full context
Know the
entire data
infrastructur
e
Know the
entire data
flows
patterns
Establish
and
execute
remediation
policies
Apply same
governance
to
processing
Monitor
through
certified
audits
“6 Steps for GDPR” expanded to unstructured enterprise data
63
Know who and what application produces and uses
Pii Data
Know the Pii data that rests in your unstructured data
Know its exact location, expiry date, consent status
Set and execute your policies based on your granular
knowledge of the content
Log every event touching your data Atlas and
Ranger are integrated in fully automated processes in
SynerScope
Have the data instantly available at individual record
level for external (certified) audit purposes (Big4 love
sampling)
64. GDPR compliance for all content
Transparency for governance
• Data Discovery
• Data Search
• Data Matching
• Data Context
• Data Quality
• Data Use patterns
• Audit Ready (Big Four
endorsed)
64(C) 2018 SynerScope
Numbers Text
IoT Video, Audio
Eco-
system
Include the “other“ 80% of
the enterprise data
65. SynerScope Product position
for GDPR and IFRS
FAST, FLEXIBLE AND TRANSPARENT
o Fast and flexible with raw data, no cleaning, no
upfront modeling
o Fast and flexible for complex new combinations of
data brought in from many different silos
o Transparency at individual cell record level, with data
presented in full context allows for certified audits
o The big audit firms will play an important role
between the enterprise, regulators and supervisories
o Eco system demands independent certification of
data operations
Unstructured data is the Achilles Heel for true
GDPR compliance
…… “SynerScope’s Intelligence Augmentation (IA)
can handle the most complex data situations fast
and reliably” (Big 4 accounting firm)
(C) 2018 SynerScope
What does ecosystem look like?
Connectors for Sqoop, Hive, Storm, Kafka as well as custom integration method to build your own connector via highly scalable REST API. For ex, although there is no first class connector for Spark, you can hook a snippet of code at end of your Spark job to report lineage/metadata info into Atlas. More native connectors being worked for future releases: NiFi and Hbase
We also have partner program for ‘Gov ready’ certification and you can see a list of partners who have already built integration
Some interesting ones:
Talend: data pipelining done in their canvas gets faithfully converted into Atlas lineage graph so we’re able to capture all the steps/transformations/metadata for each of the processes/entities in that chain
Dataguise/Waterline do data discovery and are able to publish classification in bulk into Atlas. Same can be done for lineage
IGC is special…its joined at the hip with Atlas: they will have one to one model equivalency in terms of backend and will be able to query each other for metadata/lineage etc
The slide shows the high-level control flow (the title 1st line for each of the 3 boxes): a DMX-h job runs, produces lineage info which is later on published in Atlas
More details for each box:
DMX-h job executes – currently in the product we are looking at the lineage at the sources/targets level. From the perspective of Atlas, we need to categorize sources/targets that standardized in Atlas, e.g. Hive, HDFS, vs the ones that are not. This is so that DMX-h can later on publish the lineage around these sources/targets as expected by Atlas.
DMX-h job produces lineage information – currently this is done for ingestion, and not for distributed executions, and not at field-level.
DMX-h job lineage is published into Atlas – DMX-h publishes lineage using (existing) HDFS files and Hive tables entities in Altas, as they are standardized. Other tools (e.g. Hive SQL queries) can use the same HDFS/Hive entities to publish their own lineage, therefore “connecting” to ours from DMX-h
We use the REST API to publish the DMX-h lineage. In the product we currently use v1 of the APIs, which is now Legacy, as v2 is the most current. Need to update our product for v2.
This is a simple DMX-h job that ingests an EBCDIC file into the cluster and converts it to ASCII on the fly.
Syncsort DMX-h is highly-efficient software with a small footprint, yet it packages the comprehensive support you need to manage, secure and govern your modern data architecture:
Manage: Full integration with Apache Ambari
Secure:
Native LDAP and Kerberos support
Integration with Apache Ranger
Secure mainframe data access through FTPS and Connect:Direct
Govern:
Tight integration with HCatalog for metadata management and data lineage
Work directly with mainframe data in its native format – preserving data lineage
Can tag metadata that contains Personal Identifiable Information which is critical for GDPR compliance (i.e. knowing where personal data is stored)
A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
A better way is needed – so that, just like the chef, we can have a complete view of our data, from the origin to the data hub – and know what has happened to it at every step of the way
Syncsort/Hortonworks reference architecture
Deployed by Ambari
On every node
Data movement and transformation
MapReduce or Spark
Syncsort/Hortonworks reference architecture
Deployed by Ambari
On every node
Data movement and transformation
MapReduce or Spark
First AI powered all-in-one big data solution
Solves the big data myth once and for all
Data ingest – Organize – Search – Analyze – Extract
All in One
Ultra-fast big data visual analytics
Unlock the big data complexity
Interactive and dynamic user interface
fusing Deep Learning with a scalable Data Lake into a ready-to-go Big Data solution
DSR – Data Subject Rights as it flows through the data assets such as consent
First AI powered all-in-one big data solution
Solves the big data myth once and for all
Data ingest – Organize – Search – Analyze – Extract
All in One
Ultra-fast big data visual analytics
Unlock the big data complexity
Interactive and dynamic user interface
fusing Deep Learning with a scalable Data Lake into a ready-to-go Big Data solution