Data Contracts: Consensus as Code - Pycon 2023

Ryan Collingwood
Ryan CollingwoodBusiness Analyst | Requirements Wrangler | Boundary Spanner | Continuously Learning à Containerchain
Data Contracts
Consensus as Code
Ryan Collingwood
2023-08-18
Who am I and my current context
• Ryan Collingwood, Head of Data & Analytics at Oroton
• Australia’s oldest luxury fashion company
• Centralised Data Team
• Monoliths (ERP & POS) surrounded by number of SaaS
• Data is mostly moved in batch
Why I think you might care about this
Responsibility in the
modern data stack
Andrew Jones -
Driving Data Quality with
Data Contracts (2023)
Shout out to Andrew Jones
https://data-contracts.com/
Similar, Related, and Complementary Concepts
APIs Data
Dictionaries
Data Mesh Event Storming
I’d be curious to know what else you might add to this list
Data Catalogs
Domain Driven
Design
Advice is a form of nostalgia. Dispensing it is a way
of fishing the past from the disposal, wiping it off,
painting over the ugly parts and recycling it for
more than it's worth
Mary Schmich
https://www.chicagotribune.com/columns/chi-schmich-sunscreen-column-column.htm
“If I could offer you only one tip for the future, sunscreen would be it.”
What are Data
Contracts?
... outlines how data can get exchanged between two parties.
It defines the structure, format, and rules of exchange in a
distributed data architecture. These formal agreements make
sure that there aren’t any uncertainties or undocumented
assumptions about data.
https://atlan.com/data-contracts/
... is an agreed interface between the generators of data and
its consumers. It sets the expectations around that data,
defines how it should be governed, and facilitates the explicit
generation of quality data that meets the business
requirements.
Andrew Jones - Driving Data Quality with Data Contracts (2023)
Data Producers and Data Consumers
Team A Team B
Team C
You can be a Data Producer without knowing about it
Non-consensual API
Team C
��
Broken pipelines, broken non-promises
Non-consensual API
Non-consensual API
Non-consensual API
🧰󰠼
❌
Team A
Team C
��
Team B
One of the largest impediments to addressing data quality at any organization is the
lack of collaboration between data producers and data consumers.
...
A common workaround (is the) proliferation of non-consensual APIs.
Can’t get a software engineer to emit the data you need to solve some business
problem?
Connect your ELT tool to a production source and extract a batch dump on a
schedule.
Easy
(Until things start breaking…whoops).
Chad Sanderson - https://dataproducts.substack.com/p/the-production-grade-data-pipeline
What makes up a Data Contract
https://github.com/PacktPublishing/Driving-Data-Quality-with-Data-Contracts/blob/main/Chapter03/order_events.yaml
However, data contracts are more than just a
schema... we need our data contracts to capture
metadata that describes how the data can be used,
how it is governed, and the controls around the data
Driving Data Quality with Data Contracts - Andrew Jones (2023)
What makes up a Data Contract
Schema
Contract
Governance
Semantics
Service Level
Objectives
Dataset
Governance
Mechanisms of
Transmission
People
Schema versus Semantics
Schema Semantics
Systems interoperability Human Expectations
Support for Implicit Validation
by Database Technologies
Tends to require Explicit
Validation by complimentary
solutions
Ensuring we capture and
retrieve the data consistently
Ensuring we interpret the data
consistently
Dates / times, monetary values - are a trap if considered only as schema.
What are your “schema” but “secretly semantic” situations?
Minimum Viable Data Contract Tooling
Andrew Jones - Driving Data Quality with Data Contracts (2023)
Operate
Meta-Data Powered Tooling
Andrew Jones -
Driving Data Quality
with Data Contracts
(2023)
Data Quality Checks
Andrew Jones -
Driving Data Quality
with Data Contracts
(2023)
Data Contract Tooling - My Context
Data Contract Tooling - My Context
Producer
Boundaries
Semantics
Schema &
SLOs
Checks
and Tests
Semantics
Schema &
SLOs
Checks
and Tests
Semantics
Ok so how are
we going to
make this all
happen?
Awesome humans who
understand models,
abstractions, constraints
You could even do it in
✨code ✨
... and you should definitely
version control it
Why Code? Why not Text?
● Entanglement of meaning and representation
● Finding References instead of text matches
● Enforcement of structure
● Refactoring
● Testable constraints
● More options for document generation
○ Including JSON and yaml
Although... I’ve been having a blast using Logseq (a graph like outliner) and
I might be crazy enough to give that a go as an IDE for this
“Refactoring” Text
Expectation Reality
https://xkcd.com/208/
Scope &
Allies
Constraints
& Guiding
Principles
People
and
Process
Centric
Contract
Meta
Schema
Maximise
Contribution
Opportunities
What was considered
Guiding
Principles
● Primary Objective: Consensus
● Evolution
● Quick Feedback
● First Outcome: Data Tests
Creating a Meta
Model
● Focused around Events
● From UI to DB
● Schema and Semantics
● People
... still figuring it out
Don’t have to do it all at once!
Data Contracts: Consensus as Code - Pycon 2023
The optimistic path to capturing and generating contracts
The Event Capture spreadsheet
Who’s Going to Do The Work?
Andrew Jones - Driving Data Quality with Data Contracts (2023)
Probably
these people
Hopefully
these people
Why Python? ● Gradual Typing*
● Static Analysis
● Well understood within the team
Helpful Python
Libraries
● Pandas
● Pydantic
● Rope
● Pytest
● Mypy
● Black
Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023
Refactoring, doing variable extraction with Rope
https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
Refactoring, doing variable extraction with Rope
https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
Code Refactoring - Other Libraries
• https://pybowler.io/ - doesn't have variable extraction and not much
development activity in the last while
• https://github.com/hchasestevens/astpath - useful for finding parts of the AST
but then I'm not sure how to proceed with it, seems to be powering a number
of meta-programming libs though
• traad - https://av.tib.eu/en/media/19947
Further explorations for wrangling generated code
• Abstract Syntax Tree - Options for querying
• Linting - Define my own rules to as they apply to the meta
schema
• Code duplication detection
• Network (Graph) Analysis
linkedin.com/in/ryancollingwood
mastodon.social/@ryancollingwood
twitter.com/ryancollingwood
www.meetup.com/en-AU/data-engineering-melbourne
• You can be a Data Producer without knowing about it, make it
worthwhile for Consumers to “register” with you
• You can do this through having a contract which provides clarity and
can be used to power tooling and generate artefacts
• Code is easier to refactor, find references, and generally maintain than
the alternatives
Key Takeaways
My References
• Andrew Jones - Driving Data Quality with Data Contracts (2023) - ISBN 13 978-1837635009
• Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos -
https://atlan.com/data-contracts/
• Chad Sanderson - The Production-Grade Data Pipeline -
https://dataproducts.substack.com/p/the-production-grade-data-pipeline
• Chad Sanderson and Adrian Kreuziger - An Engineers Guide to Data Contracts -
https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/
• Green Tree Snakes the missing Python AST docs - https://greentreesnakes.readthedocs.io/en/latest/
• Rope - Refactoring Variable Extraction -
https://rope.readthedocs.io/en/latest/library.html#performing-refactorings
Questions?
linkedin.com/in/ryancollingwood
mastodon.social/@ryancollingwood
twitter.com/ryancollingwood
www.meetup.com/en-AU/data-engineering-melbourne
1 sur 44

Recommandé

BigData Analysis par
BigData AnalysisBigData Analysis
BigData AnalysisInnfinision Cloud and BigData Solutions
1.6K vues21 diapositives
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What... par
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
189 vues12 diapositives
Roadmap for Enterprise Graph Strategy par
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyNeo4j
1.4K vues37 diapositives
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F... par
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Codemotion
1.5K vues92 diapositives
Data engineering design patterns par
Data engineering design patternsData engineering design patterns
Data engineering design patternsValdas Maksimavičius
1K vues53 diapositives
Big data business case par
Big data   business caseBig data   business case
Big data business caseKarthik Padmanabhan ( MLE℠)
1K vues38 diapositives

Contenu connexe

Similaire à Data Contracts: Consensus as Code - Pycon 2023

Ordering the chaos: Creating websites with imperfect data par
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
777 vues30 diapositives
Building an enterprise Natural Language Search Engine with ElasticSearch and ... par
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Debmalya Biswas
305 vues24 diapositives
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli par
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
543 vues35 diapositives
How to Get Cloud Architecture and Design Right the First Time par
How to Get Cloud Architecture and Design Right the First TimeHow to Get Cloud Architecture and Design Right the First Time
How to Get Cloud Architecture and Design Right the First TimeDavid Linthicum
12.4K vues65 diapositives
Your Roadmap for An Enterprise Graph Strategy par
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
340 vues34 diapositives
How Cloud is Affecting Data Scientists par
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists CCG
104 vues27 diapositives

Similaire à Data Contracts: Consensus as Code - Pycon 2023(20)

Ordering the chaos: Creating websites with imperfect data par Andy Stretton
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
Andy Stretton777 vues
Building an enterprise Natural Language Search Engine with ElasticSearch and ... par Debmalya Biswas
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Debmalya Biswas305 vues
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli par Data Driven Innovation
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
How to Get Cloud Architecture and Design Right the First Time par David Linthicum
How to Get Cloud Architecture and Design Right the First TimeHow to Get Cloud Architecture and Design Right the First Time
How to Get Cloud Architecture and Design Right the First Time
David Linthicum12.4K vues
Your Roadmap for An Enterprise Graph Strategy par Neo4j
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j340 vues
How Cloud is Affecting Data Scientists par CCG
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
CCG104 vues
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411 par Mark Tabladillo
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Mark Tabladillo575 vues
Jeremy cabral search marketing summit - scraping data-driven content (1) par Jeremy Cabral
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy Cabral363 vues
Knowledge Graph for Machine Learning and Data Science par Cambridge Semantics
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
Your Roadmap for An Enterprise Graph Strategy par Neo4j
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j180 vues
Data Discovery and Metadata par markgrover
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover610 vues
La bi, l'informatique décisionnelle et les graphes par Cédric Fauvet
La bi, l'informatique décisionnelle et les graphesLa bi, l'informatique décisionnelle et les graphes
La bi, l'informatique décisionnelle et les graphes
Cédric Fauvet1.2K vues
Optimizing Your Supply Chain with Neo4j par Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
Neo4j46 vues
Microsoft Build 2020: Data Science Recap par Mark Tabladillo
Microsoft Build 2020: Data Science RecapMicrosoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science Recap
Mark Tabladillo196 vues
2022-09-14-MATLABDay_SREC.pptx par AnjanMayra1
2022-09-14-MATLABDay_SREC.pptx2022-09-14-MATLABDay_SREC.pptx
2022-09-14-MATLABDay_SREC.pptx
AnjanMayra127 vues
Your Roadmap for An Enterprise Graph Strategy par Neo4j
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j1.2K vues
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re... par Chris Andrews
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...
Chris Andrews63 vues

Dernier

[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation par
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
17 vues29 diapositives
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx par
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
6 vues12 diapositives
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... par
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...DataScienceConferenc1
5 vues18 diapositives
apple.pptx par
apple.pptxapple.pptx
apple.pptxhoneybeeqwe
6 vues15 diapositives
META.pptx par
META.pptxMETA.pptx
META.pptxvasanthan19012003
6 vues10 diapositives
Best Home Security Systems.pptx par
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptxmogalang
9 vues16 diapositives

Dernier(20)

[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation par DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx par DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... par DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
Best Home Security Systems.pptx par mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 vues
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... par DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
Chapter 3b- Process Communication (1) (1)(1) (1).pptx par ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
Listed Instruments Survey 2022.pptx par secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat452 vues
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines par DataScienceConferenc1
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Luca Morena - From Psychohistory to Curious Machines
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx par DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
Dr. Ousmane Badiane-2023 ReSAKSS Conference par AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 vues
CRIJ4385_Death Penalty_F23.pptx par yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 vues
PRIVACY AWRE PERSONAL DATA STORAGE par antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 vues
SUPER STORE SQL PROJECT.pptx par khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 vues

Data Contracts: Consensus as Code - Pycon 2023

  • 1. Data Contracts Consensus as Code Ryan Collingwood 2023-08-18
  • 2. Who am I and my current context • Ryan Collingwood, Head of Data & Analytics at Oroton • Australia’s oldest luxury fashion company • Centralised Data Team • Monoliths (ERP & POS) surrounded by number of SaaS • Data is mostly moved in batch
  • 3. Why I think you might care about this Responsibility in the modern data stack Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 4. Shout out to Andrew Jones https://data-contracts.com/
  • 5. Similar, Related, and Complementary Concepts APIs Data Dictionaries Data Mesh Event Storming I’d be curious to know what else you might add to this list Data Catalogs Domain Driven Design
  • 6. Advice is a form of nostalgia. Dispensing it is a way of fishing the past from the disposal, wiping it off, painting over the ugly parts and recycling it for more than it's worth Mary Schmich https://www.chicagotribune.com/columns/chi-schmich-sunscreen-column-column.htm “If I could offer you only one tip for the future, sunscreen would be it.”
  • 8. ... outlines how data can get exchanged between two parties. It defines the structure, format, and rules of exchange in a distributed data architecture. These formal agreements make sure that there aren’t any uncertainties or undocumented assumptions about data. https://atlan.com/data-contracts/ ... is an agreed interface between the generators of data and its consumers. It sets the expectations around that data, defines how it should be governed, and facilitates the explicit generation of quality data that meets the business requirements. Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 9. Data Producers and Data Consumers Team A Team B Team C
  • 10. You can be a Data Producer without knowing about it Non-consensual API Team C ��
  • 11. Broken pipelines, broken non-promises Non-consensual API Non-consensual API Non-consensual API 🧰󰠼 ❌ Team A Team C �� Team B
  • 12. One of the largest impediments to addressing data quality at any organization is the lack of collaboration between data producers and data consumers. ... A common workaround (is the) proliferation of non-consensual APIs. Can’t get a software engineer to emit the data you need to solve some business problem? Connect your ELT tool to a production source and extract a batch dump on a schedule. Easy (Until things start breaking…whoops). Chad Sanderson - https://dataproducts.substack.com/p/the-production-grade-data-pipeline
  • 13. What makes up a Data Contract https://github.com/PacktPublishing/Driving-Data-Quality-with-Data-Contracts/blob/main/Chapter03/order_events.yaml
  • 14. However, data contracts are more than just a schema... we need our data contracts to capture metadata that describes how the data can be used, how it is governed, and the controls around the data Driving Data Quality with Data Contracts - Andrew Jones (2023)
  • 15. What makes up a Data Contract Schema Contract Governance Semantics Service Level Objectives Dataset Governance Mechanisms of Transmission People
  • 16. Schema versus Semantics Schema Semantics Systems interoperability Human Expectations Support for Implicit Validation by Database Technologies Tends to require Explicit Validation by complimentary solutions Ensuring we capture and retrieve the data consistently Ensuring we interpret the data consistently Dates / times, monetary values - are a trap if considered only as schema. What are your “schema” but “secretly semantic” situations?
  • 17. Minimum Viable Data Contract Tooling Andrew Jones - Driving Data Quality with Data Contracts (2023) Operate
  • 18. Meta-Data Powered Tooling Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 19. Data Quality Checks Andrew Jones - Driving Data Quality with Data Contracts (2023)
  • 20. Data Contract Tooling - My Context
  • 21. Data Contract Tooling - My Context Producer Boundaries
  • 24. Ok so how are we going to make this all happen? Awesome humans who understand models, abstractions, constraints You could even do it in ✨code ✨ ... and you should definitely version control it
  • 25. Why Code? Why not Text? ● Entanglement of meaning and representation ● Finding References instead of text matches ● Enforcement of structure ● Refactoring ● Testable constraints ● More options for document generation ○ Including JSON and yaml Although... I’ve been having a blast using Logseq (a graph like outliner) and I might be crazy enough to give that a go as an IDE for this
  • 28. Guiding Principles ● Primary Objective: Consensus ● Evolution ● Quick Feedback ● First Outcome: Data Tests
  • 29. Creating a Meta Model ● Focused around Events ● From UI to DB ● Schema and Semantics ● People ... still figuring it out Don’t have to do it all at once!
  • 31. The optimistic path to capturing and generating contracts
  • 32. The Event Capture spreadsheet
  • 33. Who’s Going to Do The Work? Andrew Jones - Driving Data Quality with Data Contracts (2023) Probably these people Hopefully these people
  • 34. Why Python? ● Gradual Typing* ● Static Analysis ● Well understood within the team
  • 35. Helpful Python Libraries ● Pandas ● Pydantic ● Rope ● Pytest ● Mypy ● Black
  • 38. Refactoring, doing variable extraction with Rope https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
  • 39. Refactoring, doing variable extraction with Rope https://colab.research.google.com/drive/1fHLit3hF2G0dFV0Xl11jnovcdPR87s-E
  • 40. Code Refactoring - Other Libraries • https://pybowler.io/ - doesn't have variable extraction and not much development activity in the last while • https://github.com/hchasestevens/astpath - useful for finding parts of the AST but then I'm not sure how to proceed with it, seems to be powering a number of meta-programming libs though • traad - https://av.tib.eu/en/media/19947
  • 41. Further explorations for wrangling generated code • Abstract Syntax Tree - Options for querying • Linting - Define my own rules to as they apply to the meta schema • Code duplication detection • Network (Graph) Analysis
  • 42. linkedin.com/in/ryancollingwood mastodon.social/@ryancollingwood twitter.com/ryancollingwood www.meetup.com/en-AU/data-engineering-melbourne • You can be a Data Producer without knowing about it, make it worthwhile for Consumers to “register” with you • You can do this through having a contract which provides clarity and can be used to power tooling and generate artefacts • Code is easier to refactor, find references, and generally maintain than the alternatives Key Takeaways
  • 43. My References • Andrew Jones - Driving Data Quality with Data Contracts (2023) - ISBN 13 978-1837635009 • Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos - https://atlan.com/data-contracts/ • Chad Sanderson - The Production-Grade Data Pipeline - https://dataproducts.substack.com/p/the-production-grade-data-pipeline • Chad Sanderson and Adrian Kreuziger - An Engineers Guide to Data Contracts - https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/ • Green Tree Snakes the missing Python AST docs - https://greentreesnakes.readthedocs.io/en/latest/ • Rope - Refactoring Variable Extraction - https://rope.readthedocs.io/en/latest/library.html#performing-refactorings