American Water will share the success story of American Water’s production use case of leveraging Hadoop and Streaming to ingest and supply de-normalized data from the source transactional systems to end-user applications. It covers the end-to-end flow and the challenges faced.
The data is de-normalized into single subject views at the source to eliminate complex join logic during ingestion into the data lake. Within the views, only timestamps on highly volatile tables have been exposed to give visibility to updates and inserts that have occurred on a table. NiFi ingests the data with a new processor and then stores it in ACID tables in Hive. The custom processor polls the timestamp columns, which generates paginated queries that consists of the delta.
American Water’s use case: Our field employees are our front line with our customers and in the past have felt unable to help customers effectively with our past technologies. One of the largest initiatives is to enable our field employees with accurate and up-to-date information via a new application so they can provide a great customer experience.
Speaker
John Kuchmek, American Water, Sr. Technologist
Adam Michalsky, American Water, Senior Technologist
2. WHO WE ARE
We serve a broad national footprint and a strong
local presence.
We provide services to approximately 15 million
people in 46 states and Ontario, Canada.
We employ 6,900 dedicated and active employees
and support ongoing community support and
corporate responsibility.
We treat and deliver more than one billion gallons
of water daily.
We are the largest and most geographically
diverse publicly traded water and wastewater
service provider in the Unites States.
3. Problem Statement
Achieve fast change data capture from SAP while providing de-
normalized data sets to end consumers without impacting the
source transactional systems.
Hana table replication maintains source system normalization
which can be a problem for business logic design in application
use
No Hana change data capture existed using denormalized table
structures
4. Environment
4 Management Nodes:
(32 Cores x 78 GB)
8 Compute Nodes
(32 Cores x 128 GB)
2 Management Nodes:
(6 Cores x 16 GB)
5 NiFi Nodes
(16 Cores x 64 GB)
14. Average Memory Used (hourly)
0
10
20
30
40
50
60
70
80
90
100
MEMORYINGB
TIME
Average Memory Used Across 8 Node Cluster
Average of Minimum Memory Used
Average of Average Memory Used
Average of Peak Memory Used
The end result in HANA will look like this. UPDATE_TS is our timestamp field.
Special Notes:
A timestamp will only be updated once a change occurs. After initial replication timestamps will be null or 0.
If you want to add a timestamp on a table that already exists on SLT then it needs to be re-replicated.