This talk will provide description of several real-life Telco scenarios and our implementations of cloud-based data solutions for them. It will include insights on the pros and cons of cloud for each use case, and why we chose the specific tools and architectures.
[DSC Adria 23] Miro MIljanic Telco Data Pipelines in the Cloud Architecture and Use Cases.pptx
1. Telco Data Pipelines in
the Cloud:
Architecture and Use
Cases
www.croz.net
Miro Miljanić
Data Architect
2. Agenda
• Introduction
• Use Cases:
• Reporting and data migration
• File processing offload
• User Feedback Anonymization
• Conclusion and Q&A
3. What do all
Cloud initiatives
have in
common?
They are all structured as a teen Rom-Com
1. Act One: Setup
- Introduction of the protagonist
- Establish the status quo
- Inciting incident
2. Act Two: Conflict
- Rising action
- Complications
- The false victory
- Emotional low point
3. Act Three: Resolution
- Climax
- Resolution
- Happily ever after
They also involve (over) optimistic assumptions…
5. Reporting and
data migration
Act One
Setup
Business problem:
• We want to move the reporting and its data in the cloud:
• To have Single view of data, in Cloud (operational and analytical)
• Cloud Reimplementation of reporting – source and tool replacement
• Initial Governance setup
6. Reporting and
data migration
Act One
Setup
What does it really mean:
• (Delta) Replication of data from several on-prem DB into one
Cloud DB
• Reimplementation of reporting logic from legacy reports and
DB procedures
• Data catalogue, lineage, data access…
• Strict data security rules
7. Reporting and
data migration
Act One
Setup
On Premise
Data
Reports
Governance
Security
Business logic
reimplementation
Initial setup
Delta Data replication
8. Reporting and
data migration
Act One
Setup
Off we go
• Thorough analysis and Proofs of Concept (POC)
• Sources - Multiple DB and RDBMS, >10k objects
• Target architecture
• Delta load scenarios
• DB code reimplementation scenarios
• Reporting logic reimplementation scenarios and reporting
optimizations
• Data maintenance scenarios
• …
10. Reporting and
data migration
Act Three
Resolution
What did we do:
• Simpler scenario which was manageable in X days:
• No delta replication of data – initial DDLs and then loops and inserts
inserts from views and tables using data dictionary
• No code and report reimplementation
• Setup of all environments, systems and replication
• Security setup
• Development templates
• Cookbooks
11. Reporting and
data migration
Act One
Setup
On Premise
Data
Reports
Governance
Security
Business logic
reimplementation
Initial setup
Delta Data replication
13. File processing
offload
Act One
Setup
Business problem:
• File processing offload to Cloud
• We want to reduce our resources - DB, storage, ETL
• We want to be more flexible in terms of scaling
• We want to learn how this differs from our current process (and
have new solution as similar as possible to previous)
14. File processing
offload
Act One
Setup
What does it really mean:
• Large number of raw files
• Possible DQ and format issues – human intervention needed
• Time constraints for processing
• Performance requirements
• Detailed logging – per each file
• Preferred set of tools
15. File processing
offload
Act One
Setup
• Off we go
• POC
• Similar loading and processing logic as on prem: iterative, high
logging, multiple step processing
• Processing in ETL tool
• Detailed DQ checks per each file
16. File processing
offload
Act One
Setup
DQ checks
File preprocessing
Error Bucket Verify
and fix
File landing area
Error
Loop until ready for aggregation
Clean files
Aggregation
Export Archive
18. File processing
offload
Act Three
Resolution
What did we do:
• Various load, processing and logging scenarios until we found the
found the solution
• Where to process files and how?
• What can be sequential and what can be parallel?
• How to log file processing?
• How to handle and pinpoint errors?
• Database, ETL or Code file processing?
• Database, ETL or Code data logging?
…or combination?
19. File processing
offload
Act Three
Resolution
• Preparation and DQ:
• Iterative, sequential, high logging
• Direct filesystem access
• Error pinpointing
• Orchestration
• Parallel processing of clean files
21. User Feedback
Anonymization
Act One
Setup Business problem:
• User feedbacks contain free form entry in which users
sometime enter comments which contain personal information.
• We want to keep those comments, but we can keep this
information for only 90 days, after that we would like to
anonymize it.
22. User Feedback
Anonymization
Act One
Setup
• Hi, my name is Miro Miljanić from CROZ, I contacted your
agent yesterday about the problem on my address –
Marohnićeva 1, Zagreb. I checked today and it seems that
problem still exists. Please contact me on 099887766 or on this
e-mail - mmiljanic@croz.net. Thanks in advance.
• Hi, my name is <Person> from <Organization>, I contacted
your agent yesterday about the problem on my address –
<Location>, <Location>. I checked today and it seems that
still exists. Please contact me on <Phone Number> or on this
e-mail <e-mail>. Thanks in advance.
• Hi, my name is James Bond from Microsoft, I contacted your
agent yesterday about the problem on my address – 5th
Avenue, New York, NY. I checked today and it seems that
exists. Please contact me on 01999289029 or on this e-mail
dj.example2234@example.com. Thanks in advance.
23. User Feedback
Anonymization
Act One
Setup
What does it really mean:
• Application and company specific lingo
• Additional checks needed in some cases (human)
• E2E pipeline which can be reused for other purposes
28. Conclusion • In more than one way, Clouds are teens, also. They are:
• Sometimes immature
• Sometimes unpredictable
• Have a logic of their own (different than adults)
(But also, with a fresh vision, and part of our future)
29. Conclusion Our first job is to explain this to our customers, and try to manage their
expectations
• How it is (not) going to solve all their problems
• How it differs from what they are used to
• Cost, Scope of Work, Way of Work, Maintenance, Performance …
And use POCs as much as possible, for all new topics for your customer.
Thank You all for coming today, my name is Miro Miljanić and I’m Data Architect in CROZ and I’m currently responsible for managing Cloud data initiatives.
In CROZ, we have a long history of Data and SW engineering and consulting, but we’ve only in last few years began to gain significant experience in Cloud.
The talk today is about several of our experiences with data and AI related Cloud initiatives. Although it has a Telco in its name, not all of the examples were built for Telco companies, but, they are legitimate use cases which could be used in any Telecommunication company.
In Act One, the Setup, the main protagonist is introduced, its everyday life and the incident - introduction of love interest or a challenge.
In our case this the initial project or POC (pi ou si) description and its drive.
Act Two: Conflict - contains the pursue of the main character towards the goal and complications that arise. At one point, the main character seems to have reached its goal, but it is the false victory, there are some deeper issues or the victory is short-lived. It is the Emotional low point of our character where he learns an important lesson and re-evaluate their priorities.
In our case this is a turning point – the situation that arouse and changed the course of action.
Act Three: Resolution, it’s all about happy ending. Our hero overcomes all its obstacles, wins his love, becomes a better person and they live happily ever after.
So, in the third part, I’ll explain what we did to solve the problem.
Single view ment than there should be replication and delta replication from several (different engine) databases into one Cloud DB that should be used not only for corporate reporting, but for other analysis, also. This would be the main purpose of this DB, the applications, data processing and ETL logic will remain on on-prem databases.
Reimplementation ment rewriting reporting logic from legacy DB code, views, packages, procedures and legacy reporting tool to new reporting tool.
And initial Governace ment that there should be catologization and lineage of data together with data access and security model. As at every change implementation – this was the right time to enforce
So, since this was a large scope of work we began thorough analysis of the requirements, and multiple source systems which contained more than 10k objects and their code. We also started working on architecture and several POCs (pi ou sis) regarding various scenarios:
Delta load – what we have to implement on source to propagate of only changes to target DB
DB code reimplementation scenarios – how to handle it, where to do it and not to affect application logic.
How to manage reporting logic reimplementation, reporting optimizations, data maintenance scenarios and so on
Things were going pretty well, we had a good relationship with the customer and understanding of complexity of the task, analysis and architecture setup was on the track, when we received a following response from the customer:
That’s all nice, but this is a bit too much for us now, what could we get for X days?
And Yes, the X was not nearly the number we anticipated for the whole solution.
Things were going pretty well, …
So, what did we do?
Anonymization will provide a way to remove private datafrom the reviews without deleting them, and it will allow to keep the reviews in the database so that itcould be used for other purposes in the future e.g. -topic extraction, sentiment analysis
Example of the original message, Regular Anonymization and Synthetic replacement
The scope of this projects covers NAMED ENTITY RECOGNITION machine learning problem
Anonymization – Rule engine for rule based anonymization – PII that could be recognized with an algorithm, by specific format e.g. for e-mail, phone number.
PII detection puts labels together with label confidence, for each label.
Things were going pretty well, …
AWS, Databricks problem
1. Human in the loop - Label studio – open source application – integrated in our solution, deployed as app service on Azure
2. Human in the loop enhancement – Human in the loop is used as gold label – new training data