Being one of biggest and oldest telecom providers in Netherlands with multiple acquisitions over last few decades left KPN with 1500+ data sources and more than 25 teams working with different tools like Teradata, Informatica, Oracle, OBIEE, Hadoop etc. This resulted into a lot of technical debt and also duplicated data on various systems with complex data relationship. Thus resulting into data quality issues and long processing times to get meaningful insights into business.
This created need for an unified way of Ingesting, Storing and Transforming data to be consumed and processed at multiple stages where KETL Framework was born.
Our journey so far:
Instead of developing mappings and workflows or handcrafted SQL Code, Business teams started writing metadata about their sources and dependency between them using macro based excel files or Django based YAML file generator. These files are used by the KETL framework to generate appropriate Hive/Spark/Informatica/Oracle/Teradata Code along with Airflow Scheduler DAG with schedule and dependency as Code.
Additionally all the environments, configuration and access rights are also managed via YAML files with Ansible thus enabling us to view each change as code. This made teams self sufficient where they can build their own Dev/Test environment to validate their metadata and target model structure before deploying to production.
Benefits:
- On-boarding new sources and integrating with existing data store takes less than a week
- Everything is maintained in Git, giving full visibility of changes along with their timelines
- Minimising technical depth and allowing business teams to focus on data instead of tooling
- Easier adoption of newer tools and path of least resistence for decommissioning of legacy stack
- KISS Architecture, easier to maintain and scale
- Reducing bureaucratic processes and design for transparency
What keeps us busy:
- Adding a Test Framework to enable users with BDD tests using the same metadata
- Adding functionality to generate complex code structures
- Using advanced CI/CD processes like Jenkins pipeline for faster deployments
- Integration with new tools and technologies both enterprise and opensource.
Tools/Technologies used:
- Hortonworks HDP - HDFS, Yarn, Hive, Spark, LLAP, Tez
- User and Access Management - Ranger, Knox, Kerberos, LDAP, SSSD, Linux ACL's
- ETL & DWH Tools - Informatica, Informatica BDM, Teradata, Querygrid, Aster etc.
- Reporting - OBIEE, Tableau, Zeppelin Notebooks
- Monitoring - Grafana, Zabbix & ELK
- Scheduler - Airflow
- Orchestration - Ansible
- Code - Python, YAML, Jinja2
- CI/CD - Git, Artifactory, Jenkins
Speaker
Gerhard Messelink, Teacher, KPN
Driving Behavioral Change for Information Management through Data-Driven Gree...
KPN ETL Factory (KETL) - Automated Code generation using Metadata to build Data Models and Datamarts
1. KPN ETL Framework
Automated Code generation using Metadata to build Data Processes
KETL
Gerhard Messelink
19-04-18KPN ETL Framework1
2. KPN Transformation
Change on all fronts
• Organization
o Agile teams / dev-Ops
o Self steering
• Platform
o Teams in control
o ‘Everything’ As A Service (infrastructure platforms applications)
• Tools
o Automate all processes (provisioning, delivery pipeline, testing,
development)
19-04-18KPN ETL Framework2
3. KETL Framework
• Patterns
o Hadoop
o BRE
o datamodel
• Specify metadata about the data processes and models
• Maintain standards in templates
19-04-18KPN ETL Framework3
4. Data Lake Structure: Landing Zone
• Temporary File landing Area
• CrushFTP file delivery
• Automated Creation
• Off-site backup in the DR-NAS
19-04-18KPN ETL Framework4
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
5. Data Lake Structure: Raw
• Main storage of Raw Information
• No reprocessing whatsoever
• Data is never to be deleted unless its expired
• Schema is applied here based on metadata
• Automated Creation
• Less efficient, but highly scalable (HDFS)
• Allows us to rebuild all next zones at any time if required
19-04-18KPN ETL Framework5
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
6. Data Lake Structure: Core
• Automated Creation
• Structured representation of the RAW
• No Business Logic applied – data as in the RAW
• Partitioned, compressed storage, Optimized for Read Performance
• DQ Checks Passed
• Data Governance, Lineage, Auditing
19-04-18KPN ETL Framework6
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
7. Data Lake Structure: Mirror
• Latest State of the source
• Based on User Defined Business Keys
• Full Dump, Delta,Transactional, etc. methods
• Automated Creation
• Recreated on each build
• Historical Mirrors kept for a period of time
• Data Governance, Lineage, Auditing
19-04-18KPN ETL Framework7
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
8. Data Lake Structure: DataMart
• Automated Creation
• Basic Combinations of Core or Mirror Entities
• Basic Filter and Aggregation functionality
• Basic Business Logic
19-04-18KPN ETL Framework8
Landing
Zone
Raw
Zone
Core
Zone
Mirror
Zone
DataMart
Zone
9. Integration Layer: BRE
• DataTransformation
• Metadata Lineage
• Data Surveillance / PMPC
• Erwin Integration
• Data Profiling
• Scheduling
19-04-18KPN ETL Framework9
LSM TPL
Mirror
Zone CLDM
10. Integration Layer: LSM
Transformation of source model to target logical model using:
• Filters
• Aggregations
• Sorting
• Case statements
• Derived Calculations
19-04-18KPN ETL Framework10
LSM TPL
Mirror
Zone CLDM
12. Integration Layer: CLDM
Logical Data Model:
• Delta Management
• History Management
• Load Management
19-04-18KPN ETL Framework12
LSM TPL
Mirror
Zone CLDM
13. Data Movement, Transformation workflows
Hardware
Data flow , Datastores Hardware
Data Flow execution scheduling
Hardware
CICD Tooling hardware
Preparation Hardware
Governance & Monitoring Hardware
Governance & Monitoring
Metadata Manager
ProActive
Monitoring
Job Control Framework
KETL Platform Overview (Animated)
Core KETL CI/CD Tooling
Metadata
Collection
Version
Management
Build
Code
Automated
Deployment
D->T->A->P
HOD-IL
Landing
Zone
HOD-DL
Source LSM TPL cLDMLanding
zone
RAW CORE Mirror
Data Movement, Transformation workflows and Governance
Powercente
r
BDM
Data Flow execution scheduling
Airflow
Preparation
Business
Glossary
Analyst
Reference
360
DQ
o SQL & HQL DDLs
o Directory Structures
o Hadoop Code
o Informatica Flows
o Schedules (DAGs)
o Pyhon scripts for DQ
o Transfer code
o Lineage
o Monitoring rules
o Logging
o Profiles
Acquisition Integration
19-04-18KPN ETL Framework13
14. Governance & Monitoring
Metadata Manager
ProActive
Monitoring
Job Control Framework
KETL Platform Overview (Animated 2)Core KETL
CI/CD Tooling
Metadata
Collection
Version
Management
Build
Code
Automated
Deployment
D->T->A->P
HOD-IL
Landing
Zone
HOD-DL
Source LSM TPL cLDMLanding
zone
RAW CORE Mirror
Data Movement, Transformation workflows and Governance
Powercente
r
BDM
Data Flow execution scheduling
Airflow
Data flow , Datastores Hardware
Data Movement, Transformation workflows
Hardware
Data Flow execution scheduling
Hardware
CICD Tooling hardware
Preparation Hardware
Governance & Monitoring Hardware
Preparation
Business
Glossary
Analyst
Reference
360
DQ
Installation automation
Version Managed Configuration
15. Use metadata specification
• Write specifications so
o Translation isn't needed
o It is possible to execute them
o No need for interpretation
o Less risk for misunderstandings
• The format should be:
o Easy to read
o Easy to understand ---> Review-ability
o Easy to discuss
o Easy to parse ---> by a computer
19-04-18KPN ETL Framework15
16. Using Version Control for self-service
• Contribute to the framework in an `open source` style
• Git repositories
o Ketl ansible repository
o Ketl template repository
o Ketl metadata repository
o Ketl robot test repository
• Using Pull Requests
• Merge to specific branches will trigger the build and deploy process
19-04-18KPN ETL Framework16
17. Reproducible infrastructure
• Saltstack repository
o Managing packages
o Managing access
• Ansible repository
o Manage applications
o Manage configuration
19-04-18KPN ETL Framework17
18. KETL Roadmap
• Security & Privacy & Compliance enhancements
o Automate Ranger rules
o Row level tagging
• Extending the framework
o Datamart transformation model
o Improve user interface
19-04-18KPN ETL Framework18