Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop

•Télécharger en tant que PPTX, PDF•

3 j'aime•691 vues

American Water will share the success story of American Water’s production use case of leveraging Hadoop and Streaming to ingest and supply de-normalized data from the source transactional systems to end-user applications. It covers the end-to-end flow and the challenges faced. The data is de-normalized into single subject views at the source to eliminate complex join logic during ingestion into the data lake. Within the views, only timestamps on highly volatile tables have been exposed to give visibility to updates and inserts that have occurred on a table. NiFi ingests the data with a new processor and then stores it in ACID tables in Hive. The custom processor polls the timestamp columns, which generates paginated queries that consists of the delta. American Water’s use case: Our field employees are our front line with our customers and in the past have felt unable to help customers effectively with our past technologies. One of the largest initiatives is to enable our field employees with accurate and up-to-date information via a new application so they can provide a great customer experience. Speaker John Kuchmek, American Water, Sr. Technologist Adam Michalsky, American Water, Senior Technologist

Technologie

Bridging the Gap
John “Colonel” Kuchmek
Adam “Chief” Michalsky

WHO WE ARE
We serve a broad national footprint and a strong
local presence.
We provide services to approximately 15 million
people in 46 states and Ontario, Canada.
We employ 6,900 dedicated and active employees
and support ongoing community support and
corporate responsibility.
We treat and deliver more than one billion gallons
of water daily.
We are the largest and most geographically
diverse publicly traded water and wastewater
service provider in the Unites States.

Problem Statement
Achieve fast change data capture from SAP while providing de-
normalized data sets to end consumers without impacting the
source transactional systems.
Hana table replication maintains source system normalization
which can be a problem for business logic design in application
use
No Hana change data capture existed using denormalized table
structures

Environment
4 Management Nodes:
(32 Cores x 78 GB)
8 Compute Nodes
(32 Cores x 128 GB)
2 Management Nodes:
(6 Cores x 16 GB)
5 NiFi Nodes
(16 Cores x 64 GB)

Data Ingestion - Architecture
Runtime
SLT
SOURCE INGEST STORAGE ANALYTICS UI/UX

CDC Process – High Level
Staging Delta Base

Metrics (Average Merge Time)
maintenancenotificati
onacts
meterataglance crmlongtext interactionrecords ecclongtext
maintenanceordersta
tus
contractaccountdoch
eaderfb
records 2764 15248 19958 18970 20235 8183 175433
base table 36220184 13589752 18753324 74356523 143224450 166172977 398561392
seconds 141 121 178 99 139 152 449
1
10
100
1000
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
#OFRECORDSLOGBASE10
Average Concurrent Merges per Table on LLAP

Metrics (Max, Min & Mean)
min(simultaneous) average(simultaneous) max(simultaneous)
records 18970 37255.85714 175433
base table 74356523 121554086 398561392
seconds 99 182.7142857 449
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
#OFRECORDSLOGBASE10
Min,Avg, Max Time for 7 Concurrent Merges on LLAP

Source System Load
DURATION (Seconds)
0
50,000
100,000
150,000
200,000
250,000
300,000
5/18/1813:41
5/18/1813:40
5/18/1813:40
5/18/1813:36
5/18/1813:36
5/18/1813:35
5/18/1813:35
5/18/1813:35
5/18/1813:35
5/18/1813:35
5/18/1813:02
5/18/1813:00
5/18/1812:58
5/18/1812:56
5/18/1812:56
5/18/1812:55
5/18/1812:54
5/18/1812:54
5/18/1812:54
5/18/1810:10
5/18/1810:08
5/18/1810:07
5/18/1810:04
5/18/1810:04
5/18/1810:03
5/18/1810:03
5/18/1810:03
5/18/1810:03
5/18/1810:03
#OFRECORDS
TIME
Source System Load (snapshot)
250,000-300,000
200,000-250,000
150,000-200,000
100,000-150,000
50,000-100,000
0-50,000

Average CPU Utilization
0
10
20
30
40
50
60
%UTILIZATION
TIME
Average CPU Usage Accross 8 Node Cluster
Average of Minimum CPU Load
Average of Average CPU Load
Average of Peak CPU Load

Average Memory Used (hourly)
0
10
20
30
40
50
60
70
80
90
100
MEMORYINGB
TIME
Average Memory Used Across 8 Node Cluster
Average of Minimum Memory Used
Average of Average Memory Used
Average of Peak Memory Used

Contenu connexe

Tendances

Big DataPriyanka Tuteja

Appunti di big dataFranco Diaspro

07 big data sgbdPatrick Bury

Introduction to YarnOmid Vahdaty

Map ReducePrashant Gupta

Big data architectureDr. Jasmine Beulah Gnanadurai

MLOps - Getting Machine Learning Into ProductionMichael Pearce

Big Data - Yesterday, Today and Tomorrow by John Mashey, TechviserAngela Hey

01 introJoonyoungJayGwak

Big data issues and challengesDilpreet kaur Virk

Big dataFACTS Computer Software L.L.C

memory management of windows vs linuxSumit Khanka

Hadoop I/O AnalysisRichard McDougall

Big dataNausheen Hasan

Introduction to Big Data Srinath Perera

Lecture1 introduction to big datahktripathy

OLAP v/s OLTPahsan irfan

OQGraph @ SCaLE 11x 2013Antony T Curtis

Data Science: Past, Present, and FutureGregory Piatetsky-Shapiro

Data warehouseRichard Bányi

Tendances (20)

Big Data

Appunti di big data

07 big data sgbd

Introduction to Yarn

Map Reduce

Big data architecture

MLOps - Getting Machine Learning Into Production

Big Data - Yesterday, Today and Tomorrow by John Mashey, Techviser

01 intro

Big data issues and challenges

Big data

memory management of windows vs linux

Hadoop I/O Analysis

Big data

Introduction to Big Data

Lecture1 introduction to big data

OLAP v/s OLTP

OQGraph @ SCaLE 11x 2013

Data Science: Past, Present, and Future

Data warehouse

Similaire à Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop

SAP CDC and NiFi Flow as a ServiceJohn Kuchmek

IoFMT – Internet of Fleet Management ThingsDataWorks Summit

Teradata a zDhanasekar T

Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jNeo4j

SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformAmazon Web Services

Troubleshooting SQL ServerStephen Rose

Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar

Netapp StoragePrime Infoserv

Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax

MIG 5th Data Centre Summit 2016 PTS Presentation v1blewington

DOWNSAMPLING DATAInfluxData

Sql server 2016 it just runs faster sql bits 2017 editionBob Ward

SAP ASE 16 SP02 Performance FeaturesSAP Technology

Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica

Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...Matt Stubbs

Big Data Architecture and DeploymentCisco Canada

Cisco connect toronto 2015 big data sean mc keownCisco Canada

Storage Efficiency Customer Success Stories Sept 2010 power pointMichael Hudak

Realtime Reporting using Spark StreamingSantosh Sahoo

Similaire à Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop (20)

SAP CDC and NiFi Flow as a Service

IoFMT – Internet of Fleet Management Things

Teradata a z

Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j

SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform

Troubleshooting SQL Server

Vargas polyglot-persistence-cloud-edbt

Netapp Storage

Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...

MIG 5th Data Centre Summit 2016 PTS Presentation v1

DOWNSAMPLING DATA

Sql server 2016 it just runs faster sql bits 2017 edition

SAP ASE 16 SP02 Performance Features

Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets

Big Data LDN 2016: Kick Start your Big Data project with Hyperconverged Infra...

Big Data Architecture and Deployment

Cisco connect toronto 2015 big data sean mc keown

Storage Efficiency Customer Success Stories Sept 2010 power point

Realtime Reporting using Spark Streaming

Plus de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Plus de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Story boards and shot lists for my a level piececharlottematthew16

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

How to write a Business Continuity PlanDatabarracks

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Advanced Computer Architecture – An IntroductionDilum Bandara

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

WordPress Websites for Engineers: Elevate Your Brandgvaughan

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Ensuring Technical Readiness For Copilot in Microsoft 365

The Ultimate Guide to Choosing WordPress Pros and Cons

DSPy a system for AI to Write Prompts and Do Fine Tuning

Story boards and shot lists for my a level piece

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Unraveling Multimodality with Large Language Models.pdf

Unleash Your Potential - Namagunga Girls Coding Club

How to write a Business Continuity Plan

Commit 2024 - Secret Management made easy

Advanced Computer Architecture – An Introduction

Are Multi-Cloud and Serverless Good or Bad?

DevEX - reference for building teams, processes, and platforms

SIP trunking in Janus @ Kamailio World 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

DevoxxFR 2024 Reproducible Builds with Apache Maven

WordPress Websites for Engineers: Elevate Your Brand

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

SAP Build Work Zone - Overview L2-L3.pptx

Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop

1. Bridging the Gap John “Colonel” Kuchmek Adam “Chief” Michalsky

2. WHO WE ARE We serve a broad national footprint and a strong local presence. We provide services to approximately 15 million people in 46 states and Ontario, Canada. We employ 6,900 dedicated and active employees and support ongoing community support and corporate responsibility. We treat and deliver more than one billion gallons of water daily. We are the largest and most geographically diverse publicly traded water and wastewater service provider in the Unites States.

3. Problem Statement Achieve fast change data capture from SAP while providing de- normalized data sets to end consumers without impacting the source transactional systems. Hana table replication maintains source system normalization which can be a problem for business logic design in application use No Hana change data capture existed using denormalized table structures

4. Environment 4 Management Nodes: (32 Cores x 78 GB) 8 Compute Nodes (32 Cores x 128 GB) 2 Management Nodes: (6 Cores x 16 GB) 5 NiFi Nodes (16 Cores x 64 GB)

5. Data Ingestion - Architecture Runtime SLT SOURCE INGEST STORAGE ANALYTICS UI/UX

6. Adding Timestamp to SLT

7. Dataset Denormalization

8. CDC Process – High Level Staging Delta Base

9. Data Ingestion

10. Metrics (Average Merge Time) maintenancenotificati onacts meterataglance crmlongtext interactionrecords ecclongtext maintenanceordersta tus contractaccountdoch eaderfb records 2764 15248 19958 18970 20235 8183 175433 base table 36220184 13589752 18753324 74356523 143224450 166172977 398561392 seconds 141 121 178 99 139 152 449 1 10 100 1000 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 #OFRECORDSLOGBASE10 Average Concurrent Merges per Table on LLAP

11. Metrics (Max, Min & Mean) min(simultaneous) average(simultaneous) max(simultaneous) records 18970 37255.85714 175433 base table 74356523 121554086 398561392 seconds 99 182.7142857 449 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 #OFRECORDSLOGBASE10 Min,Avg, Max Time for 7 Concurrent Merges on LLAP

12. Source System Load DURATION (Seconds) 0 50,000 100,000 150,000 200,000 250,000 300,000 5/18/1813:41 5/18/1813:40 5/18/1813:40 5/18/1813:36 5/18/1813:36 5/18/1813:35 5/18/1813:35 5/18/1813:35 5/18/1813:35 5/18/1813:35 5/18/1813:02 5/18/1813:00 5/18/1812:58 5/18/1812:56 5/18/1812:56 5/18/1812:55 5/18/1812:54 5/18/1812:54 5/18/1812:54 5/18/1810:10 5/18/1810:08 5/18/1810:07 5/18/1810:04 5/18/1810:04 5/18/1810:03 5/18/1810:03 5/18/1810:03 5/18/1810:03 5/18/1810:03 #OFRECORDS TIME Source System Load (snapshot) 250,000-300,000 200,000-250,000 150,000-200,000 100,000-150,000 50,000-100,000 0-50,000

13. Average CPU Utilization 0 10 20 30 40 50 60 %UTILIZATION TIME Average CPU Usage Accross 8 Node Cluster Average of Minimum CPU Load Average of Average CPU Load Average of Peak CPU Load

14. Average Memory Used (hourly) 0 10 20 30 40 50 60 70 80 90 100 MEMORYINGB TIME Average Memory Used Across 8 Node Cluster Average of Minimum Memory Used Average of Average Memory Used Average of Peak Memory Used

15. THANK YOU

Notes de l'éditeur

In HANA studio we can de-normalize the datasets.
The end result in HANA will look like this. UPDATE_TS is our timestamp field. Special Notes: A timestamp will only be updated once a change occurs. After initial replication timestamps will be null or 0. If you want to add a timestamp on a table that already exists on SLT then it needs to be re-replicated.
In HANA studio we can de-normalize the datasets.

Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop

Similaire à Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Bridging the gap: achieving fast data synchronization from SAP HANA by leveraging Hadoop

Notes de l'éditeur