SlideShare a Scribd company logo
1 of 19
KPN ETL Framework
Automated Code generation using Metadata to build Data Processes
KETL
Gerhard Messelink
19-04-18KPN ETL Framework1
KPN Transformation
Change on all fronts
• Organization
o Agile teams / dev-Ops
o Self steering
• Platform
o Teams in control
o ‘Everything’ As A Service (infrastructure platforms applications)
• Tools
o Automate all processes (provisioning, delivery pipeline, testing,
development)
19-04-18KPN ETL Framework2
KETL Framework
• Patterns
o Hadoop
o BRE
o datamodel
• Specify metadata about the data processes and models
• Maintain standards in templates
19-04-18KPN ETL Framework3
Data Lake Structure: Landing Zone
• Temporary File landing Area
• CrushFTP file delivery
• Automated Creation
• Off-site backup in the DR-NAS
19-04-18KPN ETL Framework4
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
Data Lake Structure: Raw
• Main storage of Raw Information
• No reprocessing whatsoever
• Data is never to be deleted unless its expired
• Schema is applied here based on metadata
• Automated Creation
• Less efficient, but highly scalable (HDFS)
• Allows us to rebuild all next zones at any time if required
19-04-18KPN ETL Framework5
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
Data Lake Structure: Core
• Automated Creation
• Structured representation of the RAW
• No Business Logic applied – data as in the RAW
• Partitioned, compressed storage, Optimized for Read Performance
• DQ Checks Passed
• Data Governance, Lineage, Auditing
19-04-18KPN ETL Framework6
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
Data Lake Structure: Mirror
• Latest State of the source
• Based on User Defined Business Keys
• Full Dump, Delta,Transactional, etc. methods
• Automated Creation
• Recreated on each build
• Historical Mirrors kept for a period of time
• Data Governance, Lineage, Auditing
19-04-18KPN ETL Framework7
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
Data Lake Structure: DataMart
• Automated Creation
• Basic Combinations of Core or Mirror Entities
• Basic Filter and Aggregation functionality
• Basic Business Logic
19-04-18KPN ETL Framework8
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
Integration Layer: BRE
• DataTransformation
• Metadata Lineage
• Data Surveillance / PMPC
• Erwin Integration
• Data Profiling
• Scheduling
19-04-18KPN ETL Framework9
LSM TPL
Mirror
Zone CLDM
Integration Layer: LSM
Transformation of source model to target logical model using:
• Filters
• Aggregations
• Sorting
• Case statements
• Derived Calculations
19-04-18KPN ETL Framework10
LSM TPL
Mirror
Zone CLDM
Integration Layer: TPL
Target Publishing Layer, integrate sources:
• Key Management
• Data Standardization
• Exception Handling
19-04-18KPN ETL Framework11
LSM TPL
Mirror
Zone CLDM
Integration Layer: CLDM
Logical Data Model:
• Delta Management
• History Management
• Load Management
19-04-18KPN ETL Framework12
LSM TPL
Mirror
Zone CLDM
Data Movement, Transformation workflows
Hardware
Data flow , Datastores Hardware
Data Flow execution scheduling
Hardware
CICD Tooling hardware
Preparation Hardware
Governance & Monitoring Hardware
Governance & Monitoring
Metadata Manager
ProActive
Monitoring
Job Control Framework
KETL Platform Overview (Animated)
Core KETL CI/CD Tooling
Metadata
Collection
Version
Management
Build
Code
Automated
Deployment
D->T->A->P
HOD-IL
Landing
Zone
HOD-DL
Source LSM TPL cLDMLanding
zone
RAW CORE Mirror
Data Movement, Transformation workflows and Governance
Powercente
r
BDM
Data Flow execution scheduling
Airflow
Preparation
Business
Glossary
Analyst
Reference
360
DQ
o SQL & HQL DDLs
o Directory Structures
o Hadoop Code
o Informatica Flows
o Schedules (DAGs)
o Pyhon scripts for DQ
o Transfer code
o Lineage
o Monitoring rules
o Logging
o Profiles
Acquisition Integration
19-04-18KPN ETL Framework13
Governance & Monitoring
Metadata Manager
ProActive
Monitoring
Job Control Framework
KETL Platform Overview (Animated 2)Core KETL
CI/CD Tooling
Metadata
Collection
Version
Management
Build
Code
Automated
Deployment
D->T->A->P
HOD-IL
Landing
Zone
HOD-DL
Source LSM TPL cLDMLanding
zone
RAW CORE Mirror
Data Movement, Transformation workflows and Governance
Powercente
r
BDM
Data Flow execution scheduling
Airflow
Data flow , Datastores Hardware
Data Movement, Transformation workflows
Hardware
Data Flow execution scheduling
Hardware
CICD Tooling hardware
Preparation Hardware
Governance & Monitoring Hardware
Preparation
Business
Glossary
Analyst
Reference
360
DQ
Installation automation
Version Managed Configuration
Use metadata specification
• Write specifications so
o Translation isn't needed
o It is possible to execute them
o No need for interpretation
o Less risk for misunderstandings
• The format should be:
o Easy to read
o Easy to understand ---> Review-ability
o Easy to discuss
o Easy to parse ---> by a computer
19-04-18KPN ETL Framework15
Using Version Control for self-service
• Contribute to the framework in an `open source` style
• Git repositories
o Ketl ansible repository
o Ketl template repository
o Ketl metadata repository
o Ketl robot test repository
• Using Pull Requests
• Merge to specific branches will trigger the build and deploy process
19-04-18KPN ETL Framework16
Reproducible infrastructure
• Saltstack repository
o Managing packages
o Managing access
• Ansible repository
o Manage applications
o Manage configuration
19-04-18KPN ETL Framework17
KETL Roadmap
• Security & Privacy & Compliance enhancements
o Automate Ranger rules
o Row level tagging
• Extending the framework
o Datamart transformation model
o Improve user interface
19-04-18KPN ETL Framework18
Questions & Answers
KPN ETL Framework 19-04-1819

More Related Content

What's hot

Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in Hadoop
DataWorks Summit
 

What's hot (20)

Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in Hadoop
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real World
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
SDLC with Apache NiFi
SDLC with Apache NiFiSDLC with Apache NiFi
SDLC with Apache NiFi
 
Streaming analytics manager
Streaming analytics managerStreaming analytics manager
Streaming analytics manager
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 

Similar to KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Data Models and Datamarts

oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
camyla81
 

Similar to KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Data Models and Datamarts (20)

UKGSE DB2 pureScale
UKGSE DB2 pureScaleUKGSE DB2 pureScale
UKGSE DB2 pureScale
 
UKCMG DB2 pureScale
UKCMG DB2 pureScaleUKCMG DB2 pureScale
UKCMG DB2 pureScale
 
POLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloud
 
Module 9: CDB Technical Intro
 Module 9: CDB Technical Intro Module 9: CDB Technical Intro
Module 9: CDB Technical Intro
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Government and Public Sector Partner Forum – Achieve, Innovate, Modernize (US)
Government and Public Sector Partner Forum – Achieve, Innovate, Modernize (US)Government and Public Sector Partner Forum – Achieve, Innovate, Modernize (US)
Government and Public Sector Partner Forum – Achieve, Innovate, Modernize (US)
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jWebinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
 
Sap basis-transaction-codes
Sap basis-transaction-codesSap basis-transaction-codes
Sap basis-transaction-codes
 
TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETL
 
Module 1: ConfD Technical Introduction
Module 1: ConfD Technical IntroductionModule 1: ConfD Technical Introduction
Module 1: ConfD Technical Introduction
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
Introduction to Beryllium release of OpenDaylight
Introduction to Beryllium release of OpenDaylightIntroduction to Beryllium release of OpenDaylight
Introduction to Beryllium release of OpenDaylight
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Apache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan PanesarApache SystemML Architecture by Niketan Panesar
Apache SystemML Architecture by Niketan Panesar
 
Introduction to YANG data models and their use in OpenDaylight: an overview
Introduction to YANG data models and their use in OpenDaylight: an overviewIntroduction to YANG data models and their use in OpenDaylight: an overview
Introduction to YANG data models and their use in OpenDaylight: an overview
 
360-Degree View of IT Infrastructure with IT Operations Analytics
360-Degree View of IT Infrastructure with IT Operations Analytics360-Degree View of IT Infrastructure with IT Operations Analytics
360-Degree View of IT Infrastructure with IT Operations Analytics
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Data Models and Datamarts

  • 1. KPN ETL Framework Automated Code generation using Metadata to build Data Processes KETL Gerhard Messelink 19-04-18KPN ETL Framework1
  • 2. KPN Transformation Change on all fronts • Organization o Agile teams / dev-Ops o Self steering • Platform o Teams in control o ‘Everything’ As A Service (infrastructure platforms applications) • Tools o Automate all processes (provisioning, delivery pipeline, testing, development) 19-04-18KPN ETL Framework2
  • 3. KETL Framework • Patterns o Hadoop o BRE o datamodel • Specify metadata about the data processes and models • Maintain standards in templates 19-04-18KPN ETL Framework3
  • 4. Data Lake Structure: Landing Zone • Temporary File landing Area • CrushFTP file delivery • Automated Creation • Off-site backup in the DR-NAS 19-04-18KPN ETL Framework4 Landing Zone Raw Zone Core Zone Mirror Zone DataMart Zone
  • 5. Data Lake Structure: Raw • Main storage of Raw Information • No reprocessing whatsoever • Data is never to be deleted unless its expired • Schema is applied here based on metadata • Automated Creation • Less efficient, but highly scalable (HDFS) • Allows us to rebuild all next zones at any time if required 19-04-18KPN ETL Framework5 Landing Zone Raw Zone Core Zone Mirror Zone DataMart Zone
  • 6. Data Lake Structure: Core • Automated Creation • Structured representation of the RAW • No Business Logic applied – data as in the RAW • Partitioned, compressed storage, Optimized for Read Performance • DQ Checks Passed • Data Governance, Lineage, Auditing 19-04-18KPN ETL Framework6 Landing Zone Raw Zone Core Zone Mirror Zone DataMart Zone
  • 7. Data Lake Structure: Mirror • Latest State of the source • Based on User Defined Business Keys • Full Dump, Delta,Transactional, etc. methods • Automated Creation • Recreated on each build • Historical Mirrors kept for a period of time • Data Governance, Lineage, Auditing 19-04-18KPN ETL Framework7 Landing Zone Raw Zone Core Zone Mirror Zone DataMart Zone
  • 8. Data Lake Structure: DataMart • Automated Creation • Basic Combinations of Core or Mirror Entities • Basic Filter and Aggregation functionality • Basic Business Logic 19-04-18KPN ETL Framework8 Landing Zone Raw Zone Core Zone Mirror Zone DataMart Zone
  • 9. Integration Layer: BRE • DataTransformation • Metadata Lineage • Data Surveillance / PMPC • Erwin Integration • Data Profiling • Scheduling 19-04-18KPN ETL Framework9 LSM TPL Mirror Zone CLDM
  • 10. Integration Layer: LSM Transformation of source model to target logical model using: • Filters • Aggregations • Sorting • Case statements • Derived Calculations 19-04-18KPN ETL Framework10 LSM TPL Mirror Zone CLDM
  • 11. Integration Layer: TPL Target Publishing Layer, integrate sources: • Key Management • Data Standardization • Exception Handling 19-04-18KPN ETL Framework11 LSM TPL Mirror Zone CLDM
  • 12. Integration Layer: CLDM Logical Data Model: • Delta Management • History Management • Load Management 19-04-18KPN ETL Framework12 LSM TPL Mirror Zone CLDM
  • 13. Data Movement, Transformation workflows Hardware Data flow , Datastores Hardware Data Flow execution scheduling Hardware CICD Tooling hardware Preparation Hardware Governance & Monitoring Hardware Governance & Monitoring Metadata Manager ProActive Monitoring Job Control Framework KETL Platform Overview (Animated) Core KETL CI/CD Tooling Metadata Collection Version Management Build Code Automated Deployment D->T->A->P HOD-IL Landing Zone HOD-DL Source LSM TPL cLDMLanding zone RAW CORE Mirror Data Movement, Transformation workflows and Governance Powercente r BDM Data Flow execution scheduling Airflow Preparation Business Glossary Analyst Reference 360 DQ o SQL & HQL DDLs o Directory Structures o Hadoop Code o Informatica Flows o Schedules (DAGs) o Pyhon scripts for DQ o Transfer code o Lineage o Monitoring rules o Logging o Profiles Acquisition Integration 19-04-18KPN ETL Framework13
  • 14. Governance & Monitoring Metadata Manager ProActive Monitoring Job Control Framework KETL Platform Overview (Animated 2)Core KETL CI/CD Tooling Metadata Collection Version Management Build Code Automated Deployment D->T->A->P HOD-IL Landing Zone HOD-DL Source LSM TPL cLDMLanding zone RAW CORE Mirror Data Movement, Transformation workflows and Governance Powercente r BDM Data Flow execution scheduling Airflow Data flow , Datastores Hardware Data Movement, Transformation workflows Hardware Data Flow execution scheduling Hardware CICD Tooling hardware Preparation Hardware Governance & Monitoring Hardware Preparation Business Glossary Analyst Reference 360 DQ Installation automation Version Managed Configuration
  • 15. Use metadata specification • Write specifications so o Translation isn't needed o It is possible to execute them o No need for interpretation o Less risk for misunderstandings • The format should be: o Easy to read o Easy to understand ---> Review-ability o Easy to discuss o Easy to parse ---> by a computer 19-04-18KPN ETL Framework15
  • 16. Using Version Control for self-service • Contribute to the framework in an `open source` style • Git repositories o Ketl ansible repository o Ketl template repository o Ketl metadata repository o Ketl robot test repository • Using Pull Requests • Merge to specific branches will trigger the build and deploy process 19-04-18KPN ETL Framework16
  • 17. Reproducible infrastructure • Saltstack repository o Managing packages o Managing access • Ansible repository o Manage applications o Manage configuration 19-04-18KPN ETL Framework17
  • 18. KETL Roadmap • Security & Privacy & Compliance enhancements o Automate Ranger rules o Row level tagging • Extending the framework o Datamart transformation model o Improve user interface 19-04-18KPN ETL Framework18
  • 19. Questions & Answers KPN ETL Framework 19-04-1819