SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Jochem van Grondelle, OLX Europe
Prosus Data Speaker Series, Feb 2021
To mesh or...
mess up your
data organisation
2
Jochem van Grondelle
Data Engineering Manager
@ OLX Group since 2015
3
3
OLX GROUP IS PART OF PROSUS
A collection of leading
companies and exciting
businesses!
4
4
PROSUS IS A $120B MARKET CAP. COMPANY
A global internet and entertainment
group and one of the largest
technology investors in the world.
US$120bn
Market
capitalisation
US$44.5bn
Revenues over
the last
3 years
US$3.4bn
Trading profits
over the last
3 years
US$47M
Average
invested in M&A
per annum
US$18.3bn
FY18-19
Revenues
US$3.4bn
FY18-19 Trading
profit
13 of top 20
Fastest-growing
economies*
Present in
*IMF World Economic Outlook, based on 2019E GDP growth estimates for the countries with over 50 million population
5
5
OLX GROUP TODAY:
THE WORLD'S #1 CLASSIFIEDS BUSINESS
HORIZONTALS REAL ESTATE
VERTICALS
OTHER
VERTICALS
CAR
VERTICALS
global
Turkey
Russia
UAE
Africa and
Philippines
Russia
Portugal
Poland
Romania,
Egypt
Furniture,
Europe
Heavy
machinery,
global
Services,
Poland
Poland
South Africa
Romania
Portugal
CONVENIENT
TRANSACTIONS
LATAM,
Asia,
Poland
UAE
Latin
America
South
Africa
Jobs,
India
Jobs,
Poland
+
6
6
WHO WE ARE - OLX GROUP
We are a global product and tech group.
★ +20 brands
★ 15 time zones
★ +10,000 people
★ One mindset
We are a team of 10,000+ ambitious,
curious people building market-leading
trading platforms that empower 300
million people every month to upgrade
their lives.
7
Agenda
4. Next steps and challenges
3. Our data journey
2. What is data mesh?
1. Challenges in data organizations today
8
Concepts in this presentation are based on the data mesh
architecture abstracted and promoted by Zhamak Dehgani
Challenges in data
organizations today
10
Complexity
The biggest challenge with big data is…
11
There has been a revolution in how operational applications
are being run
▪ For the last 20 years there is a continuous trend to move away from the
monolith to distributed domain driven architectures
12
However, data engineers often stay behind by ingesting all that
data in one central data lake - the biggest monolith of them all
▪ The original data warehousing approach was getting data from all different
complex domains and putting them in one big fat database.
▪ Due to issues with scale in volume and complexity, architectures evolved into a
data lake architecture: Don't worry about that whole modeling we talked about,
just get the data out of the operational systems, bring them to this big, fat data
lake in its original form.
https://martinfowler.com/articles/data-monolith-to-mesh.html
13
The data team responsible for storing big data is mostly
disconnected from consumers trying to make sense of that data.
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
14
Data teams are trying to break down their architecture by
functional areas
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
15
However, these data engineers are still siloed in between the
world of operational systems and the world of consumers
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
16
Have pity on your data engineers
▪ They are often dependent from teams
who have no incentive in providing
meaningful, truthful and correct data
▪ They have little understanding of the
source domains that generate the data
and lack the domain expertise in their
teams
▪ They need to provide data for a diverse
set of needs without access to all
consuming domain's experts.
17
In summary: We need to revolutionize our data strategy.
How can we apply domain-driven architecture to data?
▪ The cycle of innovation requires constant
adaptation to the data.
▪ This centralized system simply doesn't
scale.
▪ It has divided the work based on the
technical operation, implemented by one
or more silos of data engineers.
What is data mesh?
19
Data mesh sets a foundation for getting value from analytical
data at scale – using 4 principles.
Domain-oriented decentralized data
ownership and architecture
Data as a product
Self-service data infrastructure
as a platform
Federated governance
20
Principle 1: Domain oriented decentralized data ownership
▪ Although DDD has influenced modern
architectural thinking, the notion of business
domains have been disregarded in data.
▪ All raw data is in the lake, but there is often no
clear separation of business domains.
▪ Rather than limiting to ingesting raw data from
domains into a centrally owned data lake,
domains need to own and serve their
domain datasets in an easily consumable
way.
21
Domain-driven design moves from a ‘big ball of mud’…
“A BIG BALL OF MUD is haphazardly
structured, sprawling, sloppy, duct-tape and
bailing wire, spaghetti code jungle. We’ve all
seen them. These systems show
unmistakable signs of unregulated growth,
and repeated, expedient repair. Information is
shared promiscuously among distant
elements of the system, often to the point
where nearly all the important information
becomes global or duplicated. The overall
structure of the system may never have been
well defined. If it was, it may have eroded
beyond recognition.”
22
…to contextual models
Great article about DDD: https://medium.com/raa-labs/part-1-domain-driven-design-like-a-pro-f9e78d081f10
▪ Focused
▪ Small
▪ Decoupled
▪ Easy to change
▪ Enables autonomy
▪ Ubiquitous language
23
Some domains are more source oriented while other domains
are more consumer oriented.
Domains aligned with the source Domains aligned with consumption
Chat messages
Browsing interactions
Deliveries
Item recommendations
Customer support tickets
User segmentation
Fraud detection
24
Principle 2: Data as a product
▪ Domain data teams must consider their data assets as their products
and the rest of the organization as their customers.
– Discoverability
– Addressable
– Trustworthy and truthful
– Self-describing semantics and syntax
– Inter-operable and governed by
global standards
Design
“Build what
matters”
Marketing
“Tell people
about it”
Engineering
“Ship it!”
25
Establish the responsibility of domain data product owner –
which could simply be an additional hat to any type of engineer
▪ Makes decisions around the vision and the roadmap for the data products
▪ Concerns themselves with satisfaction of their consumers
▪ Continuously measures and improves the quality of the data
▪ Responsible for the lifecycle of the domain datasets
▪ Defines success criteria and business-aligned metrics
26
Principle 3: Self-service data infra as a platform
▪ High-level abstraction of infrastructure enables teams to autonomously own
their data products
▪ Must include tooling that supports a developer’s workflow of creating,
maintaining and running data products with less specialized knowledge
that existing technologies assume
▪ Domain agnostic
▪ Hides underlying complexity and designed in a self-service manner
▪ But: Treat domain data ownership as primary concern, and tooling and
pipelines secondary
27
The self-service data infra platform can include many generic
elements aimed at making domain data producers more efficient
– Data product versioning
– Data product schema
– Unified data access control and logging
– Data pipeline implementation and orchestration
– Data product discovery, catalog registration and
publishing
– Data governance and standardization
– Data product lineage
– Data product monitoring/alerting/log
– Data product quality metrics (collection and
sharing)
28
Principle 4: (Computational) federated governance
▪ Independent data products need to
interoperate through global standardization
▪ Naming conventions, identifiers, nulls
▪ It is an art to find a balance between what shall
be standardized globally, and what shall be left
to the domains to decide.
– For example, the semantics of ‘chat replies’
could be left to the chat team
– However, a ‘buyer’, as a population of
‘users’, is a global concern.
29
In summary: The ”great divide”.
When it still looked simple.
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
30
In summary: The ”greater divide”.
When it was still manageable.
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
31
In summary: The ”best divide”.
When we thought we could still handle it.
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
32
In summary: Data mesh
The paradigm shift
Source: Data mesh 101 – Everything you need to know to get started
Our data journey
34
A mature data infrastructure is in place and managed globally for
OLX Group to serve, ingest and consume data
Catalog
Self Service tool for data
management, consumption and
discovery
Data Engineers
Ninja
Set of libraries that unify the
tracking integration.
Hydra
A platform-agnostic raw data
HTTP collector.
Lazarus
Service that synchronises the data
from databases.
Data Lake
Data storage compliance with
data protection laws.
Reservoir
A dedicated and reduced data
storage for a specific purpose
Cerberus
A service to process real-time
events
Schema Management
Easily query via Athena,
Spectrum and Presto.
Laquesis
Self Service tool for performing
Experiments, Feature Flags and
Surveys.
Odyn
Operational Data Hub consisting
of a scheduler, operators and
storage
KaaS
Packaged solution for Apache
Kylin usage
Real time process
Machine learning
Analysts
Analytics
Product Managers
Databases
User devices
Microservices
COLLECTION GOVERNANCE CONSUMPTION
DAPI
Generic and scalable DATA API
Real time
5 minutes
5 minutes - 1 hour
SERVICES
Data Scientist
35
Scheduler
ETL-as-a-code platform with
advanced dependency
management system.
Odyn decides which tasks
should run, when and where to
achieve maximum efficiency.
It supports templating, step
expansion and notifications.
Operators
Odyn allows to easily transform
and query the data using SQL.
Out of the box it supports
Athena, Presto, Hive, Redshift,
AWS Batch, Kylin and Spark.
It can be easily extended with
custom operators written in
Python.
Storage
Odyn is a cheap and reliable
storage as the part of Data
Reservoir.
It contains a build in solution to
be compliant with all the data
protection laws.
It is fully integrated with other
data services.
Operational Data Hub
One place for data access so that
many point-to-point connections
between callers and data suppliers
do not need to be made.
Odyn allows blazing fast data
processing as well as
collaboration and sharing of
datasets between the users.
Odyn is the operational data hub consisting of a scheduler,
extendable set of operators and storage
180 users
200 active DAGs
11 K daily tasks
36
OLX Europe’s data team leverages the global data infra and
provides additional data products adapted to regional needs
EU Data
Build and maintain a best-in-class data analytics platform
that enables easy, timely, fast data discovery and consumption
Enable top-notch data mastery in our company by providing the right training
and coaching to our team and our users
Assure top quality of prepared data for the right purpose at the right time
following product and business strategy
Solve new problems in innovative and pragmatic ways grabbing
opportunities for quick value delivery
More data played back to our
external customers
Data-driven product
development lifecycle
Business decisions
fueled by data
37
For example, Sherlock as part of the analytical data platform
enables our users to discover, understand, and explore datasets
38
Our internal data academy helps anyone in OLX get more
familiar with data concepts, technologies and our platforms
39
Finally, Yamato is a large AWS Redshift data base empowering
many data teams to process data blazingly fast in an easy way
250 users
150 K queries daily
16xDC2.8XL - 40 TB
40
Some of our consumer-oriented data teams are already
checking all the boxes for data mesh
Customer support Sales CRM integration
Recommendations Customer classification
41
Challenges in data teams
▪ Data teams are often a bottleneck and cannot keep up
with product development fast enough
▪ Data teams are the go-to point for expertise about domain
data, however they are not the specialists
▪ Lack of governance across data teams
▪ Duplication of work and/or reinventing the wheel
▪ Lack of software engineering principles in data processing
42
Challenges in operational teams in OLX
▪ Technical design for new features does not include requirements for analytics,
experimentation and machine learning projects –
therefore data needs not always covered from start
▪ Data ownership mostly limited to what is required to run a feature
▪ Operational teams are not always aware of the value of data
43
So, although the doors are now open across the data
organization, there is still a divide between data and tech.
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
44
So, although the doors are now open across the data
organization, there is still a divide between data and tech.
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
45
However, many things are going well - and we are getting ready
for the paradigm shift!
▪ Self-service data infrastructure is mostly in place already
▪ Data publication possible for anyone
▪ GDPR, security and access governance is in place.
▪ There is a growing acceptance for domain data ownership in operational teams
▪ We already have 100s of product managers so no lack of product thinking
Next steps and
challenges
47
Let’s remind ourselves of the 4 principles of data mesh – and
first assess what is in place.
Domain-oriented decentralized data
ownership and architecture
Data as a product
Self-service data infra
as a platform
Federated governance
48
Start small: Pilot partial embedding of domain data engineers in
a few selected operational teams on project basis
▪ Data engineers will ensure that operational teams integrate data
requirements into the design of new features
▪ Data engineers will set the foundation for domain data sets after which the
ownership remains in these operational teams
▪ Data engineers will team up with product managers
▪ Data engineers will learn from software engineers
and adopt software engineering best practices
▪ Data engineers will facilitate trainings
▪ Ensure engineering leaders are on board!
49
Make both sides aware that this is a win-win situation!
▪ Data engineers often lack software engineering
standard practices when it comes to building data
assets.
▪ Software engineers who are building operational
systems often have no experience utilizing data
engineering tool sets, or even understanding the
concept of ‘datasets’.
▪ Removing the skill set silos will lead to
creation of a larger and deeper pool of data
engineering skill sets available to the
organization!
50
Meanwhile, start mapping out the major domains, identify
ownership and develop a data maturity framework by domain
Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
51
Define how to measure success and set a baseline
Data quality
Domain-data Data consumer
needs covered
Documented and
useful datasets
Data
discoverability
Usage Satisfaction
Speed/reliability Skill levels Ease of use
Risk &
Governance
Cost &
Compliance
Ubiquitous
language
Data as a product
Self-service data infra
Federated governance
52
It sounds too good to be true. What is the fine print?
▪ Data mesh is primarily about mindset and
organization; technology is second
▪ Success depends on converging operational and
data roles – organization needs to be ready
▪ Organization needs to be big enough to benefit
▪ Data mesh is a vision that needs to be tailored to
your organization – no plug and play solution
▪ The data lake can still exist in this architecture, but
they become just another node in the mesh, rather
than being the center place.
53
You are not alone! Other companies are setting steps towards a
data mesh architecture – and a learning community is live
▪ https://launchpass.com/data-mesh-learning
54
Jochem van Grondelle
Data Engineering Manager
linkedin.com/in/jochemvangrondelle
jochem.vangrondelle@olx.com
Thank you! Feel free to reach out to discuss further
55

Contenu connexe

Tendances

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 

Tendances (20)

Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 
Data mesh
Data meshData mesh
Data mesh
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as Product
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Data Governance Workshop
Data Governance WorkshopData Governance Workshop
Data Governance Workshop
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Data Quality Best Practices
Data Quality Best PracticesData Quality Best Practices
Data Quality Best Practices
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 

Similaire à To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group)

Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
Trillium Software
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Rohit Dubey
 

Similaire à To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group) (20)

Myth Busters VII: I’m building a data mesh, so I don’t need data virtualization
Myth Busters VII: I’m building a data mesh, so I don’t need data virtualizationMyth Busters VII: I’m building a data mesh, so I don’t need data virtualization
Myth Busters VII: I’m building a data mesh, so I don’t need data virtualization
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
 
Future of Data Strategy
Future of Data StrategyFuture of Data Strategy
Future of Data Strategy
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
Your Data is Waiting. What are the Top 5 Trends for Data in 2022? (ASEAN)
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
 
Bridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need ItBridging the Last Mile: Getting Data to the People Who Need It
Bridging the Last Mile: Getting Data to the People Who Need It
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX Group)

  • 1. Jochem van Grondelle, OLX Europe Prosus Data Speaker Series, Feb 2021 To mesh or... mess up your data organisation
  • 2. 2 Jochem van Grondelle Data Engineering Manager @ OLX Group since 2015
  • 3. 3 3 OLX GROUP IS PART OF PROSUS A collection of leading companies and exciting businesses!
  • 4. 4 4 PROSUS IS A $120B MARKET CAP. COMPANY A global internet and entertainment group and one of the largest technology investors in the world. US$120bn Market capitalisation US$44.5bn Revenues over the last 3 years US$3.4bn Trading profits over the last 3 years US$47M Average invested in M&A per annum US$18.3bn FY18-19 Revenues US$3.4bn FY18-19 Trading profit 13 of top 20 Fastest-growing economies* Present in *IMF World Economic Outlook, based on 2019E GDP growth estimates for the countries with over 50 million population
  • 5. 5 5 OLX GROUP TODAY: THE WORLD'S #1 CLASSIFIEDS BUSINESS HORIZONTALS REAL ESTATE VERTICALS OTHER VERTICALS CAR VERTICALS global Turkey Russia UAE Africa and Philippines Russia Portugal Poland Romania, Egypt Furniture, Europe Heavy machinery, global Services, Poland Poland South Africa Romania Portugal CONVENIENT TRANSACTIONS LATAM, Asia, Poland UAE Latin America South Africa Jobs, India Jobs, Poland +
  • 6. 6 6 WHO WE ARE - OLX GROUP We are a global product and tech group. ★ +20 brands ★ 15 time zones ★ +10,000 people ★ One mindset We are a team of 10,000+ ambitious, curious people building market-leading trading platforms that empower 300 million people every month to upgrade their lives.
  • 7. 7 Agenda 4. Next steps and challenges 3. Our data journey 2. What is data mesh? 1. Challenges in data organizations today
  • 8. 8 Concepts in this presentation are based on the data mesh architecture abstracted and promoted by Zhamak Dehgani
  • 10. 10 Complexity The biggest challenge with big data is…
  • 11. 11 There has been a revolution in how operational applications are being run ▪ For the last 20 years there is a continuous trend to move away from the monolith to distributed domain driven architectures
  • 12. 12 However, data engineers often stay behind by ingesting all that data in one central data lake - the biggest monolith of them all ▪ The original data warehousing approach was getting data from all different complex domains and putting them in one big fat database. ▪ Due to issues with scale in volume and complexity, architectures evolved into a data lake architecture: Don't worry about that whole modeling we talked about, just get the data out of the operational systems, bring them to this big, fat data lake in its original form. https://martinfowler.com/articles/data-monolith-to-mesh.html
  • 13. 13 The data team responsible for storing big data is mostly disconnected from consumers trying to make sense of that data. Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 14. 14 Data teams are trying to break down their architecture by functional areas Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 15. 15 However, these data engineers are still siloed in between the world of operational systems and the world of consumers Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 16. 16 Have pity on your data engineers ▪ They are often dependent from teams who have no incentive in providing meaningful, truthful and correct data ▪ They have little understanding of the source domains that generate the data and lack the domain expertise in their teams ▪ They need to provide data for a diverse set of needs without access to all consuming domain's experts.
  • 17. 17 In summary: We need to revolutionize our data strategy. How can we apply domain-driven architecture to data? ▪ The cycle of innovation requires constant adaptation to the data. ▪ This centralized system simply doesn't scale. ▪ It has divided the work based on the technical operation, implemented by one or more silos of data engineers.
  • 18. What is data mesh?
  • 19. 19 Data mesh sets a foundation for getting value from analytical data at scale – using 4 principles. Domain-oriented decentralized data ownership and architecture Data as a product Self-service data infrastructure as a platform Federated governance
  • 20. 20 Principle 1: Domain oriented decentralized data ownership ▪ Although DDD has influenced modern architectural thinking, the notion of business domains have been disregarded in data. ▪ All raw data is in the lake, but there is often no clear separation of business domains. ▪ Rather than limiting to ingesting raw data from domains into a centrally owned data lake, domains need to own and serve their domain datasets in an easily consumable way.
  • 21. 21 Domain-driven design moves from a ‘big ball of mud’… “A BIG BALL OF MUD is haphazardly structured, sprawling, sloppy, duct-tape and bailing wire, spaghetti code jungle. We’ve all seen them. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition.”
  • 22. 22 …to contextual models Great article about DDD: https://medium.com/raa-labs/part-1-domain-driven-design-like-a-pro-f9e78d081f10 ▪ Focused ▪ Small ▪ Decoupled ▪ Easy to change ▪ Enables autonomy ▪ Ubiquitous language
  • 23. 23 Some domains are more source oriented while other domains are more consumer oriented. Domains aligned with the source Domains aligned with consumption Chat messages Browsing interactions Deliveries Item recommendations Customer support tickets User segmentation Fraud detection
  • 24. 24 Principle 2: Data as a product ▪ Domain data teams must consider their data assets as their products and the rest of the organization as their customers. – Discoverability – Addressable – Trustworthy and truthful – Self-describing semantics and syntax – Inter-operable and governed by global standards Design “Build what matters” Marketing “Tell people about it” Engineering “Ship it!”
  • 25. 25 Establish the responsibility of domain data product owner – which could simply be an additional hat to any type of engineer ▪ Makes decisions around the vision and the roadmap for the data products ▪ Concerns themselves with satisfaction of their consumers ▪ Continuously measures and improves the quality of the data ▪ Responsible for the lifecycle of the domain datasets ▪ Defines success criteria and business-aligned metrics
  • 26. 26 Principle 3: Self-service data infra as a platform ▪ High-level abstraction of infrastructure enables teams to autonomously own their data products ▪ Must include tooling that supports a developer’s workflow of creating, maintaining and running data products with less specialized knowledge that existing technologies assume ▪ Domain agnostic ▪ Hides underlying complexity and designed in a self-service manner ▪ But: Treat domain data ownership as primary concern, and tooling and pipelines secondary
  • 27. 27 The self-service data infra platform can include many generic elements aimed at making domain data producers more efficient – Data product versioning – Data product schema – Unified data access control and logging – Data pipeline implementation and orchestration – Data product discovery, catalog registration and publishing – Data governance and standardization – Data product lineage – Data product monitoring/alerting/log – Data product quality metrics (collection and sharing)
  • 28. 28 Principle 4: (Computational) federated governance ▪ Independent data products need to interoperate through global standardization ▪ Naming conventions, identifiers, nulls ▪ It is an art to find a balance between what shall be standardized globally, and what shall be left to the domains to decide. – For example, the semantics of ‘chat replies’ could be left to the chat team – However, a ‘buyer’, as a population of ‘users’, is a global concern.
  • 29. 29 In summary: The ”great divide”. When it still looked simple. Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 30. 30 In summary: The ”greater divide”. When it was still manageable. Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 31. 31 In summary: The ”best divide”. When we thought we could still handle it. Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 32. 32 In summary: Data mesh The paradigm shift Source: Data mesh 101 – Everything you need to know to get started
  • 34. 34 A mature data infrastructure is in place and managed globally for OLX Group to serve, ingest and consume data Catalog Self Service tool for data management, consumption and discovery Data Engineers Ninja Set of libraries that unify the tracking integration. Hydra A platform-agnostic raw data HTTP collector. Lazarus Service that synchronises the data from databases. Data Lake Data storage compliance with data protection laws. Reservoir A dedicated and reduced data storage for a specific purpose Cerberus A service to process real-time events Schema Management Easily query via Athena, Spectrum and Presto. Laquesis Self Service tool for performing Experiments, Feature Flags and Surveys. Odyn Operational Data Hub consisting of a scheduler, operators and storage KaaS Packaged solution for Apache Kylin usage Real time process Machine learning Analysts Analytics Product Managers Databases User devices Microservices COLLECTION GOVERNANCE CONSUMPTION DAPI Generic and scalable DATA API Real time 5 minutes 5 minutes - 1 hour SERVICES Data Scientist
  • 35. 35 Scheduler ETL-as-a-code platform with advanced dependency management system. Odyn decides which tasks should run, when and where to achieve maximum efficiency. It supports templating, step expansion and notifications. Operators Odyn allows to easily transform and query the data using SQL. Out of the box it supports Athena, Presto, Hive, Redshift, AWS Batch, Kylin and Spark. It can be easily extended with custom operators written in Python. Storage Odyn is a cheap and reliable storage as the part of Data Reservoir. It contains a build in solution to be compliant with all the data protection laws. It is fully integrated with other data services. Operational Data Hub One place for data access so that many point-to-point connections between callers and data suppliers do not need to be made. Odyn allows blazing fast data processing as well as collaboration and sharing of datasets between the users. Odyn is the operational data hub consisting of a scheduler, extendable set of operators and storage 180 users 200 active DAGs 11 K daily tasks
  • 36. 36 OLX Europe’s data team leverages the global data infra and provides additional data products adapted to regional needs EU Data Build and maintain a best-in-class data analytics platform that enables easy, timely, fast data discovery and consumption Enable top-notch data mastery in our company by providing the right training and coaching to our team and our users Assure top quality of prepared data for the right purpose at the right time following product and business strategy Solve new problems in innovative and pragmatic ways grabbing opportunities for quick value delivery More data played back to our external customers Data-driven product development lifecycle Business decisions fueled by data
  • 37. 37 For example, Sherlock as part of the analytical data platform enables our users to discover, understand, and explore datasets
  • 38. 38 Our internal data academy helps anyone in OLX get more familiar with data concepts, technologies and our platforms
  • 39. 39 Finally, Yamato is a large AWS Redshift data base empowering many data teams to process data blazingly fast in an easy way 250 users 150 K queries daily 16xDC2.8XL - 40 TB
  • 40. 40 Some of our consumer-oriented data teams are already checking all the boxes for data mesh Customer support Sales CRM integration Recommendations Customer classification
  • 41. 41 Challenges in data teams ▪ Data teams are often a bottleneck and cannot keep up with product development fast enough ▪ Data teams are the go-to point for expertise about domain data, however they are not the specialists ▪ Lack of governance across data teams ▪ Duplication of work and/or reinventing the wheel ▪ Lack of software engineering principles in data processing
  • 42. 42 Challenges in operational teams in OLX ▪ Technical design for new features does not include requirements for analytics, experimentation and machine learning projects – therefore data needs not always covered from start ▪ Data ownership mostly limited to what is required to run a feature ▪ Operational teams are not always aware of the value of data
  • 43. 43 So, although the doors are now open across the data organization, there is still a divide between data and tech. Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 44. 44 So, although the doors are now open across the data organization, there is still a divide between data and tech. Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 45. 45 However, many things are going well - and we are getting ready for the paradigm shift! ▪ Self-service data infrastructure is mostly in place already ▪ Data publication possible for anyone ▪ GDPR, security and access governance is in place. ▪ There is a growing acceptance for domain data ownership in operational teams ▪ We already have 100s of product managers so no lack of product thinking
  • 47. 47 Let’s remind ourselves of the 4 principles of data mesh – and first assess what is in place. Domain-oriented decentralized data ownership and architecture Data as a product Self-service data infra as a platform Federated governance
  • 48. 48 Start small: Pilot partial embedding of domain data engineers in a few selected operational teams on project basis ▪ Data engineers will ensure that operational teams integrate data requirements into the design of new features ▪ Data engineers will set the foundation for domain data sets after which the ownership remains in these operational teams ▪ Data engineers will team up with product managers ▪ Data engineers will learn from software engineers and adopt software engineering best practices ▪ Data engineers will facilitate trainings ▪ Ensure engineering leaders are on board!
  • 49. 49 Make both sides aware that this is a win-win situation! ▪ Data engineers often lack software engineering standard practices when it comes to building data assets. ▪ Software engineers who are building operational systems often have no experience utilizing data engineering tool sets, or even understanding the concept of ‘datasets’. ▪ Removing the skill set silos will lead to creation of a larger and deeper pool of data engineering skill sets available to the organization!
  • 50. 50 Meanwhile, start mapping out the major domains, identify ownership and develop a data maturity framework by domain Source: Zhamak Dehghani - Data Mesh Principles and Logical Architecture
  • 51. 51 Define how to measure success and set a baseline Data quality Domain-data Data consumer needs covered Documented and useful datasets Data discoverability Usage Satisfaction Speed/reliability Skill levels Ease of use Risk & Governance Cost & Compliance Ubiquitous language Data as a product Self-service data infra Federated governance
  • 52. 52 It sounds too good to be true. What is the fine print? ▪ Data mesh is primarily about mindset and organization; technology is second ▪ Success depends on converging operational and data roles – organization needs to be ready ▪ Organization needs to be big enough to benefit ▪ Data mesh is a vision that needs to be tailored to your organization – no plug and play solution ▪ The data lake can still exist in this architecture, but they become just another node in the mesh, rather than being the center place.
  • 53. 53 You are not alone! Other companies are setting steps towards a data mesh architecture – and a learning community is live ▪ https://launchpass.com/data-mesh-learning
  • 54. 54 Jochem van Grondelle Data Engineering Manager linkedin.com/in/jochemvangrondelle jochem.vangrondelle@olx.com Thank you! Feel free to reach out to discuss further
  • 55. 55