Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Which Change Data Capture Strategy is Right for You?
1. Which Change Data Capture
Strategy Is Right for You?
Presented by
Paige Roberts
Sr. Product Marketing Manager
Data Integration, Data Quality
1
2. Choosing a Change Data Capture Strategy
1 What is Change Data Capture?
2 Why Do Change Data Capture?
3 Strategies for Change Data Capture
4 Examples of Change Data Capture
5 Q and A
3. CDC is the process that ensures that changes made over time in
one dataset are automatically transferred to another dataset.
Change Data Capture or CDC is most often used with databases that hold important
transactional data to make sure that organizations are working with up-to-date information
across the enterprise.
Source - often used to record transactions or other business occurrences as they happen.
Target - often used to create a report or do analysis to determine a course of action.
Sometimes, data is replicated bi-directionally so that a source is also a target and vice versa.
Which Change Data Capture Strategy Is Right for You?3
What is Change Data Capture?
4. Replication Options
One Way
Two Way
Cascade
Bi-Directional
Distribute
Consolidate
Choose a
topology or
combine them to
meet your data
sharing needs
5. 5
Integrated Architecture Use Case
ERP SYSTEM
Customer Orders
Payment Details
Product Catalogue
Price List
eCOMMERCE &
WEB PORTALS
TEST & AUDIT
ENVIRONMENT
DATA EXCHANGE
WITH OUTSIDE VENDOR
(FLAT FILE)
DR /
BACKUP
8. 1. Businesses have Multiple Databases
Multiple databases are the norm
• Merger or acquisition
• Choice of multiple apps or databases for best of breed solutions
• Combination of legacy and new databases
• Multi-organization supply chain
IT infrastructures are heterogeneous
• Database platforms
• Operating systems
• Hardware
8
Drivers Behind Change Data Capture
83%
10%
8% Does your organization rely on multiple databases?
Yes No I don't know.
73%of those with multiple databases
share data among them
Does your organization share data between multiple databases?
Source: Vision Solutions ‘ 2017 State of Resilience Report
9. 2. Enabling Analytics, Reporting and BI
• Protecting performance of production
database by offloading data to a reporting
system for queries, reports, business
intelligence or analytics
• Consolidating data into centralized
databases, data marts or data warehouses
for decision making or business processing
Which Change Data Capture Strategy Is Right for You?9
Drivers Behind Change Data Capture
10. 3. Enabling Machine Learning, Advanced Analytics and AI
• Growing data volumes lead to new architectures for
data consolidation – data lakes and enterprise data hubs
based on Hadoop or Spark.
• New types of data and larger amounts of data from
multiple sources combined together create an ideal
environment for training and employing machine
learning and artificial intelligence.
• Businesses across many industries seek competitive
edge from these new technologies in use cases from
fraud detection to targeted marketing.
• ML and AI systems have a constant, voracious need for
more data, and must constantly have the latest, most
current data available to provide the promised insights.
Which Change Data Capture Strategy Is Right for You?10
Drivers Behind Change Data Capture
11. 4. Varied Business and IT Goals
• Offloading data for maintenance, backup, or testing
on a secondary system without production impact
• Maintaining synchronization between siloed
databases or branch offices
• Feeding segmented data to customer or partner
applications
• Migrating data to new databases
• Re-platforming databases to new database or
operating system platforms
11
Drivers Behind Change Data Capture
Source: Vision Solutions ‘ 2017 State of Resilience Report
For what business purpose does your organization share data
between databases?
Consolidating data from multiple sources into…
Reporting on data offloaded from the…
Synchronizing data between distributed…
Testing on offloaded data
Running business processes on offloaded data
I don’t know
0% 10% 20% 30% 40% 50% 60% 70%
12. Why do you need to capture and move the changes in your data?
• Populating centralized databases, data marts, data warehouses, or data lakes
• Enabling machine learning, advanced analytics and AI on modern data architectures like Hadoop and Spark
• Enabling queries, reports, business intelligence or analytics without production impact
• Feeding real-time data to employee, customer or partner applications
• Keeping data from siloed databases in sync
• Reducing the impact of database maintenance, backup or testing
• Re-platforming to new database or operating systems
• Consolidating databases
12
Goals for Change Data Capture
14. Which Change Data Capture Strategy Is Right for You?14
Timestamps or Version Numbers
Advantages
• Simple
• Nearly every database can query
with a where clause.
Disadvantages
• Must be built into database
• Bloats database size
• Query requires considerable compute resources in source database
• Not always reliable
15. Which Change Data Capture Strategy Is Right for You?15
Table Triggers
Advantages
• Very reliable and detailed
• Changes can be captured, almost as fast as they are
made – real-time CDC.
Disadvantages
• Significant drag on database resources, both
compute and storage.
• Requires that the database have the capability.
• Negative impact on performance of applications that
depend on the source database.
16. Which Change Data Capture Strategy Is Right for You?16
Snapshot or Table Comparison
Advantages
• Relatively easy to implement with
good ETL software.
• Requires no specialized knowledge
of the source database.
• Very dependable and accurate.
Disadvantages
• Requires repeatedly moving all data in monitored tables. May impact
target or staging system resources and network bandwidth.
• Moving lots of data can be slow, may not meet SLA’s.
• Joining, comparing, and finding changes may also take time. Even
slower.
• Not a complete record of intermediate changes between snapshot
captures.
17. Which Change Data Capture Strategy Is Right for You?17
Log Scraping
Advantages
• Very reliable and detailed.
• Virtually no impact on database or application
performance.
• Changes captured in real-time.
• No database bloat.
Disadvantages
• Every RDMS has a different log format, often not
documented.
• Log formats often change between RDBMS
versions.
• Log files are frequently archived by the database.
CDC software must read them before they’re
archived, or be able to go read the archived logs.
• Requires specialized CDC software. Cannot be
easily accomplished with ETL software.
• Can fail if connectivity is lost on source or target,
causing lost data, duplicated data, or need to
restart from initial data load.
19. 19
Syncsort DMX & DMX-h:
Simple and Powerful Big Data Integration Software
Syncsort Data Integration and Data Quality for the Cloud
DMX
• GUI for developing MapReduce & Spark jobs
• Test & debug locally in Windows; deploy on Hadoop
• Use-case Accelerators to fast-track development
• Broad based connectivity with automated parallelism
• Simply the best mainframe access and integration with Hadoop
• Improved per node scalability and throughput
High Performance
ETL Software
• Template driven design for:
o High performance ETL
o SQL migration/DB offload
o Mainframe data movement
• Light weight footprint on commodity hardware
• High speed flat file processing
• Self tuning engine
High Performance
Hadoop ETL SoftwareDMX-h
20. DMX Change Data Capture
Keep data in sync in real-time
• Without overloading networks.
• Without affecting source database
performance.
• Without coding or tuning.
Reliable transfer of data you can trust even if connectivity fails on either side.
• Auto restart.
• No data loss.
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Files
RDBMS
Streams
Streams
RDBMS
Data
Lake
Mainframe
Cloud
OLAP
21. DMX Change Data Capture Sources and Targets
SOURCES
• IBM Db2/z
• IBM Db2/i
• IBM Db2/LUW
• VSAM
• Kafka
• Oracle
• Oracle RAC
Real Application
Clusters
• MS SQL Server
• IBM Informix
• Sybase
TARGETS
• Kafka
• Amazon Kinesis
• Teradata
• HDFS
• Hive
(HDFS, ORC, Avro, Parquet)
• Impala
(Parquet, Kudu)
• IBM Db2
• SQL Server
• MS Azure SQL
• PostgreSQL
• MySQL
• Oracle
• Oracle RAC
• Sybase
• And more …
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Files
RDBMS
Streams
Streams
RDBMS
Data Hub
Mainframe
Cloud
OLAP
22. 22
Design Once, Deploy Anywhere
Syncsort Data Integration and Data Quality for the Cloud
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Get excellent performance every time
without tuning, load balancing, etc.
No re-design, re-compile, no re-work ever
• Future-proof job designs for emerging compute
frameworks, e.g. Spark 2.x
• Move from development to test to production
• Move from on-premise to Cloud
• Move from one Cloud to another
Use existing ETL and data quality skills
No parallel programming – Java, MapReduce, Spark …
No worries about:
• Mappers, Reducers
• Big side or small side of joins …
Design Once
in visual GUI
Deploy Anywhere!
On-Premise,
Cloud
Mapreduce, Spark,
Future Platforms
Windows, Unix,
Linux
Batch,
Streaming
Single Node,
Cluster
23. Which Change Data Capture Strategy Is Right for You?23
Snapshot CDC with DMX/DMX-h
• Captures database changes on a
scheduled basis
• High speed sort and join
• Transforms and enhances data
during replication
• Supplies end-to-end lineage of data
for compliance, auditing
• Any source, any target, not limited
to sources with logging
• Fast development in template-
based GUI
• Latency – Usually hourly to weekly
24. Integration in
the Cloud with
DMX ETL
“DMX allows Dickey’s to rapidly
collect, transform and load
thousands of very large files, with
diverse data types from multiple
servers across all of Dickey’s
locations, without performance
bottlenecks.”
Laura Rea, Dickey’s, CIO
24
Modernize antiquated, Excel-based
Point of Sales system analytics.
Must function with minimal on-site
infrastructure and support personnel.
• Standardize software across 500+ stores.
• 1000’sof large files
• Diverse data types – financial, operations,
inventory, purchasing
• DMX ETL
• AWS cloud-based architecture designed and
implemented by iOLAP.
• Rapid job development in visual interface – no
hand coding or scripts to maintain.
• Everyday operations data available to non-
technical business users.
AWS Cloud scales with project needs
– Dickeys pays for only what they use
Redshift updated every 15-20
minutes for quick, easy, current data-
driven business insights.
Better reporting and analytics =
more dollars saved and earned.
SOLUTION:
25. 25
Log-Based Anything to Hadoop
• Real-time capture
• Minimizes bandwidth usage with LAN/WAN
friendly replication
• Parallel load on cluster
• Updates HDFS, Hive or Impala, backed by HDFS,
Parquet, ORC, or Kudu.
• Updates even versions of Hive that did not
support updating
• Latency – Minutes (less than 5)
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Data
Lake
Cloud
Files
RDBMS
Streams
Mainframe
26. Case Study:
Guardian Life Insurance
"We found DMX-h to be very usable and
easy to ramp up in terms of skills. Most
of all, Syncsort has been a very good
partner in terms of support and listening
to our needs.“
– Alex Rosenthal, Enterprise Data Office
CHALLENGE
• Enable visualization and BI on broad range of data sets.
• Reduce data preparation, transformation times
• Reduce time-to-market for analytics projects.
• Make data assets available to whole enterprise – including Mainframe.
SOLUTION
• Created Amazon-style data marketplace, supported by data lake,
Hadoop, NoSQL. New projects reuse and build upon existing
data assets. DMX-h adds new data to the Data Lake with
each new project.
• DMX DataFunnel quickly ingested hundreds of database
tables at push of a button
• DMX Change Data Capture pushes changes from DB2 to the
data lake in real-time. Current data up-to-the minute.
BENEFITS
• Centralized standardized reusable data assets –
searchable, accessible and managed.
• DMX-h and DataFunnel accelerated
data acquisition, reduced time to
market for analytics and reporting.
27. 27
Anything to Stream, or Stream to Anything
• Real-time capture
• Minimizes bandwidth usage with LAN/WAN
friendly replication
• Parallel load on cluster
• Updates HDFS, Hive or Impala, backed by
HDFS, Parquet, ORC, or Kudu.
• Updates even versions of Hive that did not
support updating
• Latency – Real-time, actual SLA varies
depending on update speed of target,
stream settings, etc. Usually, seconds.
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Files
RDBMS
Streams
Streams
RDBMS
Data
Lake
Mainframe
Cloud
OLAP
28. Case Study:
Global Hotel Data Kept Current On the Cloud
Syncsort Data Integration and Data Quality for the Cloud28
C H A L L E N G E
• More timely collection & reporting on room availability, event bookings,
inventory and other hotel data from 4,000+ properties globally
S O LU T I O N
• Near real-time reporting - DMX-h consumes property updates from Kafka
every 10 seconds
• DMX-h processes data on HDP, loading to Teradata every 30 minutes
• Deployed on Google Cloud Platform
• Productivity: Leveraging ETL team for Hadoop
(Spark), visual understanding of data pipeline
• Insight: Up-to-date data = better business decisions
= happier customers
B E N E F I T S
• Time to Value: DMX-h ease of use drastically cut development time
• Agility: Global reports updated every 30 min – before 24 hours
29. 29
Log-Based Database to Database
• Captures database changes as they happen
• Transforms and enhances data during replication
• Minimizes bandwidth usage with LAN/WAN
friendly replication
• Ensures data integrity with conflict resolution
and collision monitoring
• Enables tracking and auditing of transactions for
compliance
• Latency – sub-second
Real-Time Replication
with Transformation
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
RDBMS
RDBMS
OLAP
30. Centralized Reporting Use Case
Casino 1
IBM i Db2
Casino 2 Casino 3 Casino 4 Casino 5 Casino 6
Single Data Warehouse Database
Windows Cluster
MS SQL Server
Business intelligence
Real time CDC replication
with transformation
• Customer loyalty
• Amounts paid
• Amounts won
• Time at the table
• Time at the machine
IBM i Db2 IBM i Db2 IBM i Db2 IBM i Db2 IBM i Db2
31. Gradual Database Re-Platforming Use Case
IBM i
Db2
Old System
Windows
SQL Server
New System
America II Corp
Active-Active replication eliminated need
for hard cutover and enabled partners to
move back and forth between systems
True zero downtime for
migration to new systems
Transformation between
different OS and database
platforms with completely
different schemas 100’s of partners moved to
new server after training at
their own pace
32. Syncsort Addresses All Your
Data Sharing Needs
✓ Enables centralization or consolidation of data
✓ Facilitates machine learning, advanced analytics and AI
✓ Facilitates real-time queries, reporting and business intelligence
✓ Transforms data for smooth data flow between databases
✓ Keeps distributed applications and data in sync
✓ Feeds real-time data to mission critical applications
✓ Offloads data for maintenance, testing and backup
✓ Migrates legacy data to new platforms
✓ And more!