Watch our latest quarterly customer education webcast to learn about the latest advancements in Syncsort DMX and DMX-h data integration software, including our new product DMX Change Data Capture (CDC).
Many of our customers use DMX-h to quickly and efficiently populate their data lakes with enterprise-wide data, to power a variety of use cases, including data as a service, data archiving, fraud detection, and Customer 360. But, as important as it is to populate the data lake, it’s equally important to keep that data current for accurate decision making.
DMX Change Data Capture makes it easy and efficient to keep your data lake fresh after the initial load with real-time data replication that continually applies changes made on your traditional systems to your cluster.
Driving Behavioral Change for Information Management through Data-Driven Gree...
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoop Keeps Your Data Lake Fresh!
1. Big Data Customer Education Webcast
Q2 2017
Paige Roberts
Product Manager Big Data
2. Agenda
Company Update
• Syncsort Trillium
• EDW Optimization with Hortonworks
Lots of Cool New Capabilities in DMX/DMX-h
• New sources
• Hive enhancements
• Spark 2.0 support
• Cloudera Director
• Metadata export
• Atlas ingestion
• Intelligent Execution with Integrated workflow
3 Especially Cool New Capabilities Coming Soon
• Big Data Quality – DMX and Trillium Integration
• DataFunnel New UI
• DMX Change Data Capture
What’s Next
2Syncsort Confidential and Proprietary - do not copy or distribute
3. Disclaimer
3Syncsort Confidential and Proprietary - do not copy or distribute
• All of the materials and information presented today are
proprietary to Syncsort and are confidential in nature.
• This presentation does not constitute a commitment on
Syncsort’s part to deliver the functionality referenced or
stated. Product release dates and/or capabilities
referenced in this document may change at any time at
Syncsort’s sole discretion.
4. Data Liberation, Integrity & Integration for Next-Generation Analytics
Marquee global customer base of leaders and
emerging businesses across all major industries
Trusted Industry Leadership
We provide unique data management solutions
and expertise to over 2,500 large enterprises worldwide
with an unmatched focus on customer success & value
Best Quality, Top Performance, Lower Costs
Our proven software efficiently delivers all critical enterprise
data assets with the highest integrity to Big Data
environments, on premise or in the cloud
Highly Acclaimed & Award Winning
• Data Quality “Leader” in Gartner Magic Quadrant
• IT World Awards® 2016 “Innovations in IT” Gold Winner
• Database Trends & Applications “Companies That Matter
Most in Data”
• Mainframe Access & Integration
for Application Data
• High-Performance ETL
Data Access & Transformation
• Mainframe Access & Integration
for Machine Data
Data Infrastructure
Optimization
Data Quality
• Big Data Quality & Integration
• Data Enrichment & Validation
• Data Governance
• Customer 360
• Enterprise Data Warehouse
Optimization
• Application Modernization
• Mainframe Optimization
6. Benefits
• Connect to virtual any data source,
including mainframe and MPP
databases.
• Move data into and out of Hadoop up to
6x faster without the need for manual
scripts.
• Develop ETL processes without writing
code.
• Seamlessly accelerate Hadoop
performance and scalability for ETL
operations in both MapReduce and
Spark.
Syncsort: High Performance Import from Existing Databases
7. Syncsort + Hortonworks Advantages
• Apache Ambari Integration
• Deploy DMX-h across cluster
• Monitor DMX-h jobs
• Process in MapReduce or Spark
• Source relational and non relational data
(including mainframes)
• Out-of-the-box integration, interoperability &
certifications
• Kerberos-secured clusters
• Apache Sentry/Ranger security certified
• Early beta, release certification
• Metadata lineage export from DMX
• Atlas integration
Technical Benefits
8. WHAT’S NEW IN DMX/DMX-H
8Syncsort Confidential and Proprietary - do not copy or distribute
9. Access: Bring ALL Enterprise Data Securely to the Data Lake
9Syncsort Confidential and Proprietary - do not copy or distribute
Database
– RDBMS
– MPP
– NoSQL
Mainframe
– DB2/z
– VSAM
– FTP Binary
– Mainframe Fixed
– Mainframe Variable
– Mainframe Distributable
– COBOL IT line sequential
– All file formats…
Big Data
– JSON
– Avro
– Parquet
– ORC
– Hive (Enhancements)
Streaming
– Kafka
– MapR Streams
– HDF (NiFi)
Cloud
– Amazon S3
– Amazon Redshift, RDS
– Google Cloud Storage
… And more!
10. Access: Hive Enhancements
Improvements to Hive support
JDBC connectivity
Support for partitioned tables: ORC, Parquet, AVRO, HDFS
Support for Truncate and Insert
Automatic creation of Hive and other Hcat supported tables
Direct distributed processing of Hive
Update of Hive statistics
10Syncsort Confidential and Proprietary - do not copy or distribute
11. Access: Hive Enhancements
Improvements to Hive support
JDBC connectivity
Support for partitioned tables: ORC, Parquet, AVRO, HDFS
Support for Truncate and Insert
Automatic creation of Hive and other Hcat supported tables
Direct distributed processing of Hive
Update of Hive statistics
Support for Hive tables with complex arrays
11Syncsort Confidential and Proprietary - do not copy or distribute
12. Combine batch and streaming data sources
Single Interface for Streaming & Batch
Spark 2.x!
Easy development in GUI No need
to write Scala, C or Java code
12
Syncsort Confidential and Proprietary - do not copy or distribute
Simplify Streaming Data Integration
Syncsort Confidential and Proprietary - do not copy or distribute
14. Comply: Manage
Syncsort Confidential and Proprietary - do not copy or distribute
14
Cloudera Manager
–Deploy DMX-h across Cloudera cluster
–Monitor DMX-h jobs
Apache Ambari
–Deploy DMX-h across Hortonworks and
other clusters
–Monitor DMX-h jobs
Cloudera Director
–Deploy DMX-h on Cloudera in the Cloud
–Elastically expand and reduce capacity as
needed for spikes in workload
15. Comply: Govern
Syncsort Confidential and Proprietary - do not copy or distribute 15
Metadata and data lineage for Hive, Avro and
Parquet through HCatalog
Metadata lineage export from DMX/DMX-h
–Simplify audits, analytics dashboards, metrics
–Integrate with enterprise metadata repositories
–Run-time job metadata and lineage export
Cloudera Navigator certified integration
–Extends HCatalog metadata
–HDFS, YARN, Spark and other metadata
–Lineage, tagging
–Business and structural metadata
Apache Atlas ingestion lineage integration
–Lineage, tagging (Technical preview available now)
–Audit and track
16. 16Syncsort Confidential and Proprietary - do not copy or distribute
Extend User Base with Data Transformation Language (DTL)
• Metadata driven dynamic
creation of DMX-h jobs
• Enables partners and end users
to build on and extend DMX
• Human readable script-like
interface for developing jobs
• Legacy ETL migrations to DMX
– Ability to import DTL to the DMX
Graphical User Interface
– Maintain applications in the GUI
– Export metadata to DTL
17. Same Solution – On Premise or In the Cloud
• ETL engine on AWS Marketplace – Update to version 9.x
• Available on EC2, EMR, Google Cloud
• S3 and Redshift connectivity
• Google Cloud Storage connectivity
• First & only leading ETL engine on Docker Hub
17Syncsort Confidential and Proprietary - do not copy or distribute
Big Data + Cloud + Syncsort = Powerful, Flexible, Cost Effective
18. Intelligent
Execution
Layer
Design Once, Deploy Anywhere
One interface to design jobs to run on:
Single Node, Cluster
MapReduce, Spark, Spark 2.x!
Windows, Unix, Linux
On-Premise, Cloud
Batch, Streaming
• Use existing ETL skills.
• No worries about mappers, reducers, big side, small side, and so on.
• Automatic optimization for best performance, load balancing, etc.
• No changes or tuning required, even if you change execution frameworks
• Future-proof job designs for emerging compute frameworks, e.g. Spark
Syncsort Confidential and Proprietary - do not copy or distribute
Intelligent Execution – Big Data technology changes fast. Syncsort lets you change with it.
19. Design One Job, Deploy Each Step Anywhere
Intelligent Execution – Big Data technology changes fast. Syncsort lets you change with it.
Syncsort Confidential and Proprietary - do not copy or distribute
Integrated Workflow
In a single job, combine any execution location, framework or style.
Ingest data on an edge node, then process on the cluster in a single workflow
Combine MapReduce ETL with Spark data analysis
Run extended tasks and custom functions in framework of your choice
Intelligent
Execution
Layer
One interface to design jobs to run on:
Single Node, Cluster
MapReduce, Spark, Spark 2.x!
Windows, Unix, Linux
On-Premise, Cloud
Batch, Streaming
23. Best-of-Breed Data Quality & Integration: A Winning Combination
Syncsort Confidential and Proprietary - do not copy or distribute
“Existing customers and prospects can view this acquisition as
positive. It extends Syncsort's information management capabilities
through strengthened data quality and data governance
functionality for the use cases they encounter.”
– “Syncsort Accelerates Data Quality With Trillium Acquisition Deal,” Gartner, December 6, 2016
24. Firstly, we configure DMX to access and ingest data
from a JSON source.
Secondly, DMX ingests data from a mainframe in
EBCDIC format.
Finally, DMX then ingests data from an XML source.
DMX then merges these files into
one consistent format.
At the same stage, DMX
produces two exports:
• one simple text/csv output
• a first write to a Hive
database.
DMX then
invokes
TSS to
perform
the Data
Quality
processing
.
Once DQ is complete,
DMX then takes back over,
and performs a join to a
3rd party (e.g. tag, match,
suppression) file.
DMX then takes the final output
and performs 4 outputs:
• a simple txt/csv file
• an optimised Tableau file
• a QlikView file
• a further write to a Hive
database.
Comments
All of these source files have different field structures too.
25. Firstly, we configure DMX to access and ingest data
from a JSON source.
Secondly, DMX ingests data from a mainframe in
EBCDIC format.
Finally, DMX then ingests data from an XML source.
DMX then merges these files into
one consistent format.
At the same stage, DMX
produces two exports:
• one simple text/csv output
• a first write to a Hive
database.
DMX then
invokes
TSS to
perform
the Data
Quality
processing
.
Comments
All of these source files have different field structures too.
27. Get Your Database data into Hadoop, At the Press of a Button
• Funnel hundreds of tables at once into your data lake
‒ Extract, map and move whole DB schemas in one invocation
‒ Extract from Oracle, DB2/z, MS SQL Server, Teradata and Netezza
‒ To SQL Server, Postgres, Hive, and HDFS
‒ Automatically create target Hive and HCat tables
• Process multiple funnels in parallel on edge node or data nodes
‒ Order data flows by dependencies
‒ Leverage DMX-h high performance data processing engine
• Extract only the data you want
‒ Data type filtering
‒ Table, record or column exclusion / inclusion
• In-flight transformations and cleansing
27
Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
Move thousands of tables in days, not weeks!
28. New User Experience for DataFunnel
28Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
30. New UI Wizard Flow Creation
30Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
31. DMX CHANGE DATA CAPTURE
31Syncsort Confidential and Proprietary - do not copy or distribute
32. DMX Change Data Capture Bridges Mainframe Data and Hadoop
Syncsort Confidential and Proprietary - do not copy or distribute
Keeps Hadoop data in sync with mainframe changes in real-time
32
• without overloading networks
• without incurring a high MIPS cost
• without affecting source database performance
• without coding or tuning.
Dependable - Reliable
transfer of data even during
loss of mainframe connection
or Hadoop cluster failure.
Continue from failure point.
Fast – Both Hive data and
table statistics updated in real-
time
Flexible – Works with all Hive
tables, including those backed
by text, ORC, Parquet or Avro
DB2 HIVE
DMX Change Data Capture
33. DMX Change Data Capture Architecture
33Syncsort Confidential and Proprietary - do not copy or distribute
1. Capture: DMX CDC engine scrapes
the DB2 logs and stores only the
delta, the data that has changed,
and flags it as Updated, Deleted or
Inserted. Virtually no MIPS usage.
3. Apply: DMX-h applies the
changes to Hive tables, and
updates Hive statistics to
facilitate queries on the new
data.
2. On an edge node in DMX-h, a
CDC Reader consumes a single
raw data stream of the delta
data, and splits it into parallel
load streams for the cluster.
36. What Next?
36Syncsort Confidential and Proprietary - do not copy or distribute
Find out more about DMX Change Data Capture
http://www.syncsort.com/en/Products/BigData/DMX-Change-Data-Capture
Talk to your account manager for a customized demo & to see how our latest features can
help you! http://www.syncsort.com/en/ContactSales