Do compilers look anything like a data pipeline? How do you do data testing to ensure end to end provenance and enforce engineering guarantees for your data products? What babysteps should you consider when assembling your team?
10. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
data quality, metadata, raw or slightly
modelled data
Lexical and Syntactic
Analysis
11. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Lexical and Syntactic
Analysis
Semantic Analysis
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
12. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Lexical and Syntactic
Analysis
Semantic Analysis
Code Generation
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
13. Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
further tests, robustness
Lexical and Syntactic
Analysis
Semantic Analysis
Code Generation
Optimisation
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
24. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
b. Also about trends and anomalies
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
26. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
27. New Feature Engineering:
“Instead of deriving the math
before feeding the model, we
ensure our features comply
with certain properties so that
the NN can do the math
effectively by itself” -- Airbnb
https://arxiv.org/pdf/1810.09591.pdf
29. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
33. What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)
4. monitoring tests check the output of the model to trigger alerts
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
34. Monitoring
Model performance
dashboard
1. model output metrics
through training, validation,
testing, and deployment.
2. data input metrics
3. operational telemetry
Image from: https://www.parallelm.com/
38. What is Special about ML?
1. New Artefacts to Manage
a. Data
b. Metadata: Hyperparameters
c. Code: architecture
d. Model: executable software “built from the data”
e. Experiment: Data + metadata + Hyperparams + Code -> Model
2. Different Process
a. Trial and error: Scientific method
b. Reproducibility - traceability
c. Explainability
41. The Data Version Control Tipping Point
● Datasets can be versioned, branched, acted upon by versioned
code to create new data sets
● Test and fill bugs against data
● Enable quality control for compiler steps
● Automated lineage and schema change deceting
● Make guarantees about system components
47. Ad Hoc Exploration1
Tools RelationshipsProcesses
- Isolated efforts
- No repeatability
- Siloed data
- Local dev
- No business buy-in - transactional
- Ivory tower
Tools
Relationships
Processes
48. Reproducible, but limited2
Tools RelationshipsProcesses
- Repeatability is patchy
- Poor governance
- Shy centralisation
- Static reports
- Heavy transactional - rapport
- Team management support only
Tools
Relationships
Processes
49. Defined, Controlled3
Tools RelationshipsProcesses
- Formal but manually enforced- Good centralization
(metadata, access)
- Live retrospective
reports
- Incipient experimentation
- Empathy
Tools
Relationships
Processes